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1.0  RESEARCH  SUMMARY 


This  year  we  focused  on  unsupervised  learning  and  adaptive 
fuzzy  systems.  Graduate  students  Seong-Gon  Kong  and  Peter  Pacini 
assisted  me  in  simulations  and  paper  preparation.  (USC's  School 
of  Engineering,  not  AFOSR,  financially  supported  Peter  Pacini.) 

In  unsupervised  learning,  we  explored  the  differential 
competitive  learning  law.  This  unsupervised  learning  law  differs 
from  the  standard  competitive  learning  law,  which  modifies  a 
synapse  only  if  its  post-synaptic  neuron  "wins”  a  competition  for 
activation  or  pattern  stimulation.  The  differential  competitive 
learning  (DCL)  law  modifies  the  synapse  only  if  its  post-synaptic 
neuron  chances.  Standard  competitive  learning  ignores  this 


instantaneous  win-rate  information.  The  plus-or-minus  sign  of 
the  neuronal  time  derivatives  endows  DCL  with  many  of  the  coding 
properties  shared  by  supervised  competitive  learning  laws,  which 
reward  or  punish  synapses,  according  as  the  post-synaptic  neuron 
classifies  or  misclassif ies  an  input  pattern,  by  changing  the 
sign  of  a  difference  term. 

We  showed  that  DCL  behaves  as  a  type  of  adaptive  delta- 
modulation  procedure.  When  sampling  highly  correlated  data,  such 
as  speech  data,  the  pairwise  difference  of  of  samples  provably 
contains  more  information  (has  less  variance)  than  the  samples 
themselves . 

We  tested  DCL  against  both  unsupervised  and  supervised 
competitive  learning  (SCL)  for  centroid  estimation  and  for 
phoneme  recogntion.  DCL  consistently  outperformed  SCL,  which 
always  outperformed  unsupervised  competitive  learning.  In  the 
nonlinear  case,  DCL  synaptic  vectors  converged  to  pattern-class 
centroids  faster  than  SCL  synaptic  vectors  conv  rged  to  them.  In 
the  linear  case,  both  types  of  synaptic  vectors  converged  equally 
quickly  to  centroids.  But  once  the  synaptic  vectors  reached  the 
centroids,  DCL  synaptic  vectors  wandered  less  about  the  centroids 
than  the  SCL  synaptic  vectors  wandered.  (Since  we  modeled 
learning  with  stochastic  differential  equations,  stochastic 
equilibria  corresponds  to  Brownian  wandering  about  a  fixed 
point.)  The  paper  "Differential  Competitive  Learning  for 
Centroid  Estimation  and  Phoneme  Recognition"  summarizes  these 
results  and  will  appear  in  the  January  1991  issue  of  the  IEEE 
Transactions  on  Neural  Networks. 


In  adaptive  fuzzy  systems,  I  developed  the  general  AFAM 
(adaptive  fuzzy  associative  memory)  methodology  and  successfully 
applied  it  to  two  control  problems:  backing  up  a  truck-and- 
trailer  rig  in  a  parking  lot  (a  mathematically  unsolved  control 
problem)  and  realtime  target  tracking.  This  research,  along  with 
my  pure  research  on  fuzzy  set  theory,  published  this  summer  as 
"Fuzziness  vs.  Probability"  in  the  International  Journal  of 
General  Systems .  has  generated  widespread  technical  and  popular 
interest.  Several  popular  and  technical  publications  have 
featured  it.  These  publications  include  Electronic  Engineering 
Times,  the  Los  Angeles  Times.  Popular  Science,  the  Economist .  and 
Breakthroughs .  As  program  chairman  of  the  first  international 
conference  on  neural  network  and  fuzzy  theory  in  Iizuka  Japan  in 
July  1990,  I  presented  these  concepts  to  a  wide  international 
audience  in  my  plenary  lecture. 

The  key  research  contribution  was  product  space  clustering 
(PSC) .  I  developed  the  geometry  of  fuzzy  systems  as  mappings 
between  unit  hypercubes.  (Fuzzy  sets  define  points  in  unit 
hypercubes.  Nonfuzzy  sets  define  the  2n  vertices  of  an  n- 
dimensional  hypercube.)  PSC  estimates  a  fuzzy  system  as  a 
surface  in  the  system's  input-output  product  state  space.  We 
partition  the  input-output  state  space  into  FAM  cells.  Each  FAM 
cell  defines  a  fuzzy  "rule"  or  general  association  of  fuzzy 
output  descriptions  with  fuzzy  input  descriptions:  If  the 
traffic  is  HEAVY  in  one  direction,  then  keep  the  light  green 
LONGER  in  that  direction.  (HEAVY,  LONGER)  defines  a  FAM  rule  in 


the  input-output  state  space  as  region  or  mini-Cartesian  product. 
PSC  estimates  these  regions  with  unsupervised  learning. 

I  showed  how  to  use  differential  competitive  learning  to 
adaptively,  quickly,  and  reliably  generate  banks  of  structured 
fuzzy  rules  given  only  the  input-output  training  data  generated 
by  the  physical  process,  the  human  controller,  or,  in  general, 
the  system  we  wish  to  estimate.  We  can  present  the  same  input- 
output  data  to  a  neural  network.  The  neural  network  may  or  may 
not  accurately  estimate  the  underlying  unknown  function  (or  joint 
probability  distribution) .  But  it  can  generate  only  a  black-box 
or  "model-free"  estimate.  Stand-alone  neural  networks  do  not 
generate  stuctured  rules. 

We  benchmarked  neural  and  fuzzy  systems  for  backing  up  a  truck, 
and  also  a  truck-and-trailer,  in  a  parking  lot.  Both  systems 
controlled  the  truck  and  truck-and-trailer  successfully.  But  the 
fuzzy  system  was  orders  of  magnitude  easier  to  construct  and,  in 
the  adaptive  case,  train.  We  could  also  modify  the  fuzzy  system 
directly  by  manipulating  its  bank  of  FAM  rules.  We  tested  the 
fuzzy  system's  robustness  by  removing  random  subsets  of  FAM  rules 
and  by  deliberately  important  rules  with  destructive  or 
"sabotage"  rules.  In  general  we  found  that  fuzzy  system 
performance  degrades  significantly  only  if  we  remove  over  50%  of 
the  FAM  rules.  We  also  showed  how  to  convert  every  neural 
network  system  to  a  structured  fuzzy  system,  complete  with  bank 
of  FAM  rules,  that  approximates  the  underlying  neural  system.  We 
demonstrated  this  for  both  the  backpropagation  truck  and  truck- 


and-trailer  systems.  The  generated  fuzzy  systems  performed 
comparably  to  the  original  fuzzy  systems. 

We  benchmarked  an  adaptive  fuzzy  system  for  realtime  target 
tracking  against  an  "optimal”  linear  Kalman-f ilter  control 
system.  Again  both  systems  controlled  the  process  well.  The 
fuzzy  system  gave  finer  control,  involved  far  less  computation, 
required  no  assumption  of  how  control  outputs  mathematically 
depended  on  control  inputs,  and  proved  robust  when  we  removed  FAM 
rules  or  replaced  them  with  sabotage  rules.  The  Kalman-f ilter 
controller  proved  sensitive  to  the  variance  of  the  unmodeled- 
effects  parameter. 


2 . 0  PUBLICATIONS 


The  above  research  led  to  several  technical  papers.  We 
published  some  in  proceedings  and  some  in  journals.  Others  await 
appearance  or  remain  in  the  review  process.  This  report  includes 
copies  of  papers. 


2.1  Journal  Papers 


Kosko,  B. , 


1. 


"Unsupervised  Learning  in  Noise," 


IEEE 


Transactions  on  Neural  Networks,  vol.  1,  no.  l,  44  -  57,  March 
1990. 

2.  Kosko,  B. ,  "Structural  Stability  of  Unsupervised  Learning  in 
Feedback  Networks,"  IEEE  Transactions  on  Automatic  Control,  in 
press,  1990. 

3.  Kosko,  B. ,  "Fuzziness  vs.  Probability,"  International  Journal 
of  General  Systems,  vol.  17,  no.  2,  211  -  240,  1990. 

4.  Kosko,  B. ,  "Differential  Competitive  Learning  for  Centroid 
Estimation  and  Phoneme  Recognition,"  with  S.G.  Kong,  IEEE 
Transactions  on  Neural  Networks,  to  appear,  January  1991. 

5.  Kosko,  B. ,  "Stochastic  Competitive  Learning,"  in  review  at 
IEEE  Transactions  on  Neural  Networks.  1990. 

6.  Kosko,  B. ,  "Adaptive  Fuzzy  Systems  for  Backing  Up  a  Truck- 
and-Trailer, "  with  S.G.  Kong,  in  review  at  IEEE  Transactions  on 
Neural  Networks.  1990. 

7.  Kosko,  B. ,  "Adaptive  Fuzzy  System  for  Target  Tracking,"  with 
P.J.  Pacini,  in  review  at  IEEE  Transactions  on  Automatic  Control. 
1990. 

8.  Kosko,  B. ,  "Fuzzy  Associative  Memories,"  to  be  submitted  to 
Neural  Networks.  1990. 


2.2  Conference  Proceedings  Papers 


1.  Kosko,  B.,  ’’Stochastic  Competitive  Learning,”  Proc.  IJCNN-90. 
vol.  II,  215  -  226,  June  1990. 

2.  Kosko,  B. ,  "Comparison  of  Fuzzy  and  Neural  Truck  Backer-Upper 
Control  Systems,"  with  S.G.  Kong,  Proc.  IJCNN-90.  vol.  Ill,  349- 

358,  June  1990. 

3.  "Differential  Competitive  Learning  for  Centroid  Estimation 
and  Phoneme  Recognition,"  Proc.  European  Conference  on  neural 
Netowrks .  Prague,  Czechoslovakia,  September  1990. 


3.0  NEXT- YEAR  RE8EARCH  OBJECTIVES 


In  the  third  year  of  this  research  program  we  will  continue  to 
explore  the  relationship  between  unsupervised  neural  network 
systems  and  fuzzy  systems. 

We  need  to  explore  both  the  feedback  dynamics  of  differential 
competitive  learning  and  the  feedforward  encoding  structure  of 
pulse-coded  DCL  synapses.  We  have  not  fully  exploited  the  delta- 


modulation  nature  of  DCL.  The  delta-modulation  structure  of  DCL 
suggests  that  digital  DCL  systems  may  provide  viable  highspeed 
communication  devices. 

We  need  to  explore  the  adaptive  fuzzy  methodology  outside  the 
domain  of  control.  Two  promising  areas  are  image/signal 
processing  and  communication  theory.  The  AFAM  methodology  may 
allow  us  to  estimate  image-compression  schemes  without  detailed 
eigenvector  math  models.  Communication  systems  depend  on  local 
highspeed  decisions  made  with  uncertain,  usually  probabilistic, 
information.  Adaptive  fuzzy  systems  may  allow  us  to  introduce 
"intelligent  communication"  at  modulation,  spreading/depsreading, 
or  encoding/decoding  level. 
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Unsupervised  Learning  in  Noise 

BART  KOSKO.  mimhik,  h  i  i 


Abstract— The  structural  stability  of  rcal-timr  unsupervised  learn¬ 
ing  in  feedback  (biumicjl  s> stents  is  dciiiunst  ruled  Midi  (lie  stochastic 
calculus.  Structural  stability  allows  globally  stable  feedback  systems  to 
lie  |H*rturbcd  without  changing  their  qualitative  equilibrium  behavior. 
These  stochastic  dynamical  systems  are  called  random  adaptive  bidirec¬ 
tional  associative  memory  (RABAM)  models,  which  include  several  pop¬ 
ular  nonadaptive  and  adaptive  feedback  models,  such  as  the  Hopheld 
circuit  and  the  ART-2  model.  RABAM  networks  can  adapt  with 
different  stable  unsupervised  learning  laws.  These  include  the  signal 
Ifcbh,  competitive,  and  differential  llebb  laws.  A  new  hybrid  learning 
law,  the  differential  competitive  taw%  which  uses  the  neuronal  signal  ve- 
IiktcI v  as  a  local  unsupervised  reinforcement  mechanism,  is  introduced 
and  its  coding  and  stability  behavior  in  feedforward  and  feedback  net¬ 
works  is  examined.  This  analysis  is  facilitated  by  the  recent  Gluck- 
Parker  pulse-coding  interpretation  of  signal  functions  in  differential 
llebbian  learning  systems.  The  second-order  behavior  of  RABAM 
Brownian -diffusion  systems  is  summarized  by  the  RABAM  noise 
suppression  theorem:  The  mean-squared  activation  and  svnaptic  ve¬ 
locities  decrease  exponentially  quickly  to  their  lower  bounds,  the  in¬ 
stantaneous  noise  variances  driving  the  system.  This  result  is  extended 
to  the  RABAM  annealing  model,  which  provides  a  unified  framework 
from  which  to  analyze  Ceman-llwang  combinatorial  optimization  dy¬ 
namical  systems  and  continuous  Boltzmann  machine  learning. 

I.  Structural  Stability  in  Hardware,  Biology, 
and  Manifolds 

OW  robust  are  unsupervised  learning  systems?  What 
happens  if  real-time  synaptic  mechanisms  are  per¬ 
turbed  in  real  time?  Will  shaking  disturb  or  prevent  equi¬ 
libria?  What  effect  will  thermal  noise  processes,  electro¬ 
magnetic  interactions,  and  component  malfunctions  have 
on  large-scale  implementations  of  unsupervised  neural 
networks?  How  biologically  accurate  arc  unsupervised 
neural  models  that  do  not  model  the  myriad  electrochem¬ 
ical,  molecular,  and  other  processes  found  at  synaptic 
junctions  and  membrane  potential  sites? 

These  questions  are  different  ways  of  asking  a  more 
general  question:  is  unsupervised  learning  structurally 
stable ?  Structural  stability  |9),  j 4 2 ]  allows  globally  stable 
feedback  systems  to  be  perturbed  without  changing  their 
qualitative  equilibrium  behavior.  This  increases  the  reli¬ 
ability  of  large  scale  hardware  implementations  of  such 
networks  It  also  increases  their  biological  plausibility. 

Manuscript  received  April  10.  10X9.  revised  October  9.  10X9  This  work 
was  supported  by  the  Air  force  Ofltcc  of  Scientific  Research  (AfOSR  XX 
02  VO  An  earlier  version  of  this  pjper  was  presented  at  the  19X9  Interna 
ltor.,*l  Joint  Conference  on  Neural  Networks  (I JONN  K9).  Washington.  IK'. 
June  IX  2.V  |‘>X*i 

I  be  author  is  uiib  the  I  >e  p.irtment  «d  I  lesltti.il  I  ngineering.  Systems. 
Signal,  and  Imae?'  Pi«h  essing  Institute.  Mniveisiiv  o|  Southern  California. 

I  os  Angeles.  CA  900K«J  02  72 
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since  the  myriad  synaptic  and  neuronal  processes  missing 
from  neural  network  models  now  arc  modeled,  but  as  net 
random  unmodeled  effects  that  do  not  affect  the  structure 
of  the  global  network  computations. 

Structural  stability  differs  from  the  global  stability,  or 
convergence  to  fixed  points,  that  endows  some  feedback 
networks  with  content  addressable  memory,  and  other, 
computational  properties.  Globally  s'able  systems  can  be 
sensitive  to  initial  conditions.  Different  inputs  states  can 
converge  to  different  limit  states;  else  memory  capacity  is 
trivial.  Structural  stability  is  insensitivity  to  small  pertur 
bations.  Such  perturbation  preserves  qualitative  proper¬ 
ties.  In  particular,  basins  of  attractions  maintain  their  basic 
shape.  In  some  intuitive  sense,  chaos  |36]  is  the  antithesis 
of  structural  stability,  or.  more  accurately,  structurally 
stable  fixed-point  attractors  (since  chaotic  attractors  can 
be  jf-i’cturally  stable). 

The  formal  approach  to  structural  stability  uses  the 
transversality  techniques  of  differential  topology  [17],  the 
study  of  global  properties  of  differentiable  manifolds. 
Manifolds  A  and  B  have  nonempty  transversal  intersec¬ 
tion  in  R "  if  the  tangent  spaces  of  A  and  B  span  A  "  at  every 
point  of  intersection,  if  locally  the  intersection  looks  like 
Rm.  Two  lines  intersect  transversely  in  the  plane  but  not 
in  3-space,  4-space,  or  higher  n-space.  If  the  lines  are 
shaken  in  2-space,  they  still  intersect.  If  shaken  in  3- 
space,  the  lines  may  no  longer  intersect.  In  Fig.  1,  man¬ 
ifolds  A  and  B  intersect  transversely  in  the  plane  at  points 
a  and  b.  Manifolds  B  and  C  do  not  intersect  transversely 
at  c. 

An  indirect  approach  to  structural  stability  uses  the  cal¬ 
culus  of  stochastic  differential  and  integral  equations  [35], 
[4 1 1 .  This  is  the  approach  used  in  this  paper.  The  sto¬ 
chastic-calculus  approach  abstracts  statistically  relevant 
behavior  from  large  sets  of  functions.  The  differential-to¬ 
pological  approach,  in  contrast,  is  concerned  with  all  pos¬ 
sible  behavior  of  all  functions  (open  dense  sets  of  func¬ 
tions).  This  makes  the  analysis  extremely  abstract -and 
calculations  cumbersome  and  often  impractical. 

The  stochastic  calculus  is  difficult  to  work  witfi  as  well, 
hut  usually  less  diliirult  than  transversality  techniques. 
The  new  complexity  that  arises  in  passing  front  systems 
of  differential  equations  to  systems  of  stochastic  differ¬ 
ential  equations  is  due  to  the  nature  of  solution  points.  In 
algebraic  equations,  such  as  2 x  +  3  -  4«,  points  in  the 
solution  space  are  numbers  Solutions  to  differential  equa¬ 
tions  are  functions.  Solutions  to  stochastic  differential 
equations  are  random  processes  ]4I|. 
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fij!  I  M  Jill  loll!  mU'tsCv  Ison  in  llu*  plane  (manifold  R).  Inlcrscchon 
jvnnls  a  jiul  /»  are  transversal  Point  t  is  not  manifolds  B  and  C  need 
not  intersect  if  even  slipluly  pert  tubed  No  points  are  transversal  in 
}  space  unless  li  is  j  sphere 

Below  we  demonstrate  the  structural  stability  of  many 
types  of  unsupervised  learning  in  the  stochastic  sense.  The 
key  idea  is  to  use  the  scalar- valued  Lyapunov  function  of 
globally  stable  feedback  networks  but  in  a  random  frame¬ 
work.  Then  the  old  Lyapunov  function  is  a  random  vari¬ 
able  at  each  moment  of  time  /,  so  it  cannot  be  minimized 
as  when  it  was  a  scalar  at  each  t.  The  trick  is  to  minimize 
its  expectation,  its  average  value,  which  is  a  scalar  at  /. 

N.  Four  Unsupekvised  Associative  Learning  Laws 
The  distinction  between  supervised  and  unsupervised 
learning  depends  on  information.  In  pattern-recognition 
theory,  for  instance,  the  distinction  is  in  terms  of  knowl¬ 
edge  of  class  boundaries.  Pattern  recognition  is  super¬ 
vised  if  the  training  algorithm  requires  knowing  the  class 
membership  of  the  training  samples,  unsupervised  if  it 
does  not  require  it. 

A  similar  distinction  holds  in  neural  networks.  Super¬ 
vised  learning  invariably  refers  to  deliberate  gradient  de¬ 
scent  in  the  space  of  all  possible  synaptic  values.  Class 
membership  information  is  needed  to  compute  the  nu¬ 
merical  error  vector  or  error  signal  that  guides  the  gra¬ 
dient  descent. 

Unsupervised  learning  usually  refers  to  the  modifica¬ 
tion  of  biological  synapses  with  physically  local  signal 
information.  Class  membership  information  of  training 
samples  is  not  needed  These  systems  adaptively  cluster 
paticms  into  classes  by,  for  example,  evolving  “win¬ 
ning”  neurons  in  a  competition  for  activation,  or  by 
evolving  different  basins  of  attraction  in  the  state  space. 
We  shall  restrict  our  attention  to  such  biologically  moti¬ 
vated  learning  methods,  knowing  that  other  types  of  un- 
supervised  learning  are  possible  and  may  be  of  practical 
engineering  value. 

Unsupcrviscd  learning  laws  are  first  order  differential 
equations  that  describe  how  synapses  evolve  in  time  with 
locally  available  information.  This  information  usually 
involves  synaptic  properties  or  neuronal  signal  properties. 
In  principle,  and  in  mammalian  brains  or  optoelectronic 
integrated  circuits,  other  types  of  information  may  be  lo¬ 
cally  available  for  computation,  glial  cells,  specific  and 
nonspecific  hormones,  background  electromagnetic  cf 
fccts.  or  light  pulses  These  phenomena  are  modeled  be¬ 
low  as  net  random  parameters  For  the  moment  they  wili 
be  ignored.  Locality  allows  asynchronous  synapses  to  op¬ 
erate  in  real  time  Mathematically,  it  also  greatly  shrinks 
the  function  space  of  possible  unsupcrviscd  learning  laws. 
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Associativity  further  shrinks  the  function  space.  Glob- 
illy,  neural  networks  associate  patterns  with  patterns 
They  estimate  continuous  functions.  Locally,  synapses  arc- 
required  to  associate  signals  with  signals.  This  leads  to 
conjunctive,  or  multiplicative,  learning  laws  constrained 
by  locality.  This  in  turn  leads  to  at  least  three  types  of 
learning  laws  and  a  new  hybrid  law. 

The  four  unsupcrviscd  associative  learning  laws  dis¬ 
cussed  in  this  section  are  1)  the  signal  Hebb  learning  law, 
2)  the  competitive  learning  law,  3)  the  differential  Hebb 
learning  law,  and  4)  the  new  hybrid  law,  the  differential 
competitive  learning  law. 

A.  Signal  Hebbian  l, earning 

The  signal  Hebb  learning  law  correlates  neuronal  sig¬ 
nals,  not  activations: 

m,j  =  ~m,j  -l  S?{x,)  S}(yj)  (l) 

where  the  overdot  indicates  time  differentiation,  m  is  the 
synaptic  efficacy  of  the  directed  axonal  edge  from  the  /th 
neuron  in  field  Fx  to  the  jih  neuron  in  field  Fy,  x,  and  y/ 
are  the  respective  real-valued  activations  or  membrane 
potentials  of  the  connected  neurons,  and  Sf  and  Sj,  here¬ 
after  abbreviated  to  S,  and  SJt  are  the  bounded  monolone- 
nondecreasing  signal  functions  of  the  connected  neurons 
that  transduce  their  time-averaged  potential  differences 
into  time-averaged  frequencies  of  pulse  trains,  and  where, 
as  in  all  equations  in  this  paper,  scaling  constants  can  be 
multiplied  or  added  where  desired.  The  logistic  signal 
function  S(x)  =  ( 1  +  e~rx)~' ,  with  c  >  0,  remains  the 
most  popular  signal  function  for  simulations  and  appli¬ 
cations.  The  logistic  signal  function  is  also  strictly  mono¬ 
tone  increasing,  since  S'  =  dS(x)/dx  =  cS(  1  —  S)  > 
0.  Strict  monotonicity  strengthens  stability  results. 

The  solution  to  (1)  is  an  integral  equation  since  in  gen¬ 
eral  x,  and  yf  depend  on  m,r  The  key  component  of  this 
integral  equation  is  an  exponentially  weighted  average  cf 
sampled  patterns: 

m„(/)  =  «„(())<•  '  -(  J(  S((.v)S,(s)e'  -  ds.  (2) 

The  exponential  weight  is  inherent  in  the  first-order  struc¬ 
ture  of  (I).  It  produces  a  recency  effect  on  memory,  as  in 
our  everyday  exponential  decrease  in  retained  informa¬ 
tion.  This  well-known  recency  effect  is  the  thrust  of  phi¬ 
losopher  David  Hume's  quote:  "The  liveliest  thought  still 
is  inferior  to  the  dullest  sensation.  “  Nothing  is  more  vivid 
than  now. 

If.  Competitive  Learning 

The  competitive  learning  law  is  obtained  from  ( I )  if  the 
passive  decay  term  is  modulated  by  the  appropriate 

local  signal: 

-V,l-Vr  ”  (3) 

The  "competitiveness"  in  (3)  is  indirect.  The  assump¬ 
tion  is  that  neurons  compete  for  activation  in  the  field  F, 
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in  the  sense  that  the  symmetric  (distance-dependent)  in- 
traficld  connections  of  Fr  arc  laterally  inhibitive:  the 
square  sym  nct~ic  matrix  Q  of  intraficid  connections  is 
positive  main-diagonal  and  nonpositivc  olT-diagonal.  or. 
more  generally.  Q  has  nonnegativc  blocks  on  its  main  di¬ 
agonal  and  nonpositivc  blocks  elsewhere.  Then  St  is  a  win- 
loss  index  of  the  j th  Fy  neuron’s  performance.  In  practice 
[ 39)  S,  <s  invariably  a  0-1  threshold  function  or  steep  lo¬ 
gistic  function,  which  behaves  as  a  threshold  function. 
Then  (3)  says  learn  only  if  win.  If  the  yth  unit  wins,  the 
signal  pattern  S(X)  =  (S,(jr,).  ....  S„(jr„))  generated 
at  Fx  is  encoded  as  the  yth  column  of  the  n-by-p  connec¬ 
tion  matrix  exponentially  quickly.  This  “grandmother 
synapse”  effect  differs  from  Hebbian  learning,  where  pat¬ 
tern  information  is  superimposed  on  all  of  M.  Then  every 
synapse  participates  in  learning  new  patterns  while,  un¬ 
fortunately,  forgetting  learned  patterns. 

Both  (1)  and  (2)  were  „.udied  as  early  as  the  1960’s  by 
Grossberg  |12J.  Kohonen  (24]  and  Hccht-Niclsen  (15]  use 
the  competitive  law  (3)  statistically  for  unsupervised  clus¬ 
tering  in  their  respective  self-organizing  map  and  coun¬ 
terpropagation  networks.  The  p  columns  of  M  then  tend 
toward  the  centroids  of  the  sampled  p  decision  classes, 
even  though  the  underlying  probability  density  functions 
are  unknown. 

C.  Differential  Hebbian  Learning 

The  differential  Hebb  law  [25]-[27],  [32],  (33],  and  its 
variants,  correlates  signal  velocities  as  well  as  signals: 

ntjj  =  -mtj  +  S;Sj  +  SjSj  (4) 

where,  by  the  chain  rule 

dSj(xj)  =  dSi  dXf  = 
dt  dx,  dt 

If  signals  are  locally  available  to  synapses,  so  are  signal 
velocities,  at  least  implicitly.  Since  the  signal  function  S, 
is  an  abstraction  of  time-averaged  spiking  frequencies,  S, 
is  often  assumed  nonegative.  Then  Hebbian  synapses  (1) 
can  only  grow  in  time.  Signal  velocities,  of  course,  can 
be  both  positive  and  negative.  Correlated  (lagged)  signals 
provide  a  local  “arrow  of  time’’  that  synapses  can  exploit 
[33]  to  encode  time-varying  patterns  as  limit  cycles.  Klopf 
[2 1  ] -(23 ]  independently  arived  at  a  similar  discrete  (dif¬ 
ference)  version  of  (4)  in  his  drive-reinforcement  theory 
of  animal  learning. 

Recently  Gluck  and  Parker  [10],  [II]  showed  that  dif¬ 
ferential  Hebbian  learning  becomes  significantly  more 
plausible  in  nervous  systems  if  we  recall  that  real  neurons 
transmit  discrete  pulse-coded  information  and  we  struc¬ 
ture  the  signal  functions  S,  and  5y  accordingly.  Suppose  x, 
and  yf  arc  pulse  functions:  x,(t)  =  I  if  a  pulse  occurs  at 
time  r.  0  if  not,  and  similarly  for  y  (/).  Then  the  signal 
frequencies  S,  and  .S’;  can  be  estimated  as  exponentially 
weighted  time  averages. 

S.(t)  =  |  x,(s)e'  '  ds 


s,0)  =  f  V ,U)c'  ‘  ds.  (6) 

J  ~  CD 

By  recalling  the  form  of  the  solution  to  a  linear  inhomo¬ 
geneous,  first-order  differential  question,  the  signal  ve¬ 
locities  are  seen  to  be  simple,  locally  available,  differ¬ 
ences: 

$(')  =  *.(#)  -  5.(0  (7) 

S,(*)  =  >',(')  -  Sfft).  («) 

Thus  a  signal  velocity  has  the  form  of  a  reinforcement 
signal:  a  pulse  less  the  current  expected  frequency  of 
pulses.  As  Gluck  and  Parker  observe,  not  only  are  these 
differences  locally  available,  they  can  be  computed  in  real 
time  without  unstable  differencing  techniques. 

For  stability  purposes,  we  note  another  consequence  of 
pulse-coded  signal  functions.  They  show  how  Hebbian 
learning  can  be  a  special  case  of  differential  Hebbian 
learning.  Suppose  the  Hebb  product  S,Sj  in  (4)  is  scaled 
down  to  zero: 

=  ~»>,j  +  S,Sj.  (9) 

This  is  the  “classical”  differential  Hebb  law  [25 J— [27] . 
Then  substituting  (7)  and  (8)  into  (9)  gives 

mtJ  =  -m,j  +  S,Sj  +  [x,yj  -  x,SJ  -  yyS, ]  (10) 

which  is  equivalent  to  the  signal  Hebb  law  (1)  if  and  only 
if  the  term  in  braces  is  zero.  Thus  the  simple  differential 
Hebb  law  (9),  and  of  course  (4)  suitably  scaled,  reduces 
to  the  signal  Hebb  law  when  no  pulses  occur,  when  xfft) 
=  yfft)  =  0.  This  happens  frequently.  For,  in  any  con¬ 
nected  time  interval,  the  set  v  of  times  where  pulses  oc¬ 
cur,  { t x,(t‘)  =  I  J,  has  Lebesgue  measure  zero.  (Con¬ 
sider  pulses  at  rational  time  points  or  at  Cantor  set  points.) 
This  interpretation,  though,  would  imply  [38]  by  (5)  and 
(6)  that  5;  =  Sj  =  0  almost  everywhere,  so  the  integrals 
in  (5)  and  (6)  would  have  to  be  replaced  with  discrete 
sums  (using  point-mass  measures). 

The  infrequency  of  unit  pulses  occurs  while  the  synapse 
continually  modifies  its  behavior.  When  instantaneous 
pulse  information  is  not  available,  the  synapse  “fills  in” 
with  expected  pulse  frequencies,  and  hence  Hebbian 
learning.  Since  signal  Hebbian  learning  is  unconditionally 
stable  (the  ABAM  theorem,  reviewed  below)  in  many 
nonlinear  dynamical  systems,  including  popular  feedback 
neural  networks,  pulse-coded  differential  Hebbian  dy¬ 
namical  systems  may  be  stable  over  a  wider  range  of  sys¬ 
tem  parameters  than  earlier  velocity-acceleration  stability 
assumptions  [32 1 ,  [33 ]  suggested. 

D.  Differential  Competitive  Learning 

The  fourth  unsupervised  learning  law  is  a  new  hybrid 
learning  law,  the  differential  competitive  law: 

“  ,n'/l  (  1 1  ^ 

The  idea  is  learn  only  if  change.  As  with  the  competi¬ 
tive  learning  law  (3),  the  neurons  in  /•’,  compete  for  acti 
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vation,  and  the  nonnegative  signal  functions  Sy  keep  score. 
The  signal  velocity  Sj  in  (II)  is  a  local  reinforcement 
mechanism.  Its  sign  indicates  whether  the yth  neurons  are 
winning  or  losing,  and  its  magnitude  measures  by  how 
much.  The  coding  and  dynamical  behavior  of  (II)  can  be 
analyzed  with  the  pulse-coding  interpretation  |I0),  (II) 
of  signal  functions  and  by  comparison  with  Kohonen’s 
recent  “supervised''  adaptive-vector-quantization  algo¬ 
rithm  1 24 ) . 

The  pulse-coded  deferential  competitive  learning  law 
is  the  difference  of  nondifferential  competitive  laws: 

=  (>',  -  SMS,  -  m,,]  (12) 

=  >',!■$  -  ]  -  Sj ( $  -  >n,j\  (13) 

where  x,  is  a  0-1  pulse  function.  Hence  the  standard  com¬ 
petitive  learning  law  (3)  is  recovered  when  ys  —  I  and  5y 
=  0.  This  occurs  when  the yth  unit  has  just  won  the  com¬ 
petition  for  activation  within  Fy. 

Usually  in  a  competition  there  are  many  more  losers 
than  winners.  So  suppose  the  yth  neuron  in  FY  is  a  loser  at 
time  t.  Then  yy(r)  =  0  holds  and  has  held  over  some, 
perhaps  short,  past  interval  Jr',  f).  Then  Sy(r)  =  0  (or 
nearly  0)  by  the  exponential-weight  structure  of  (6).  So 
no  change,  no  learning. 

Now  suppose  the  yth  unit  wins  in  the  next  instant  t.  Then 
yj  =  1  over  some  interval  ft,  r" )  of  nonzero  Lebesgue 
measure.  During  this  interval  the  exponential-weight 
structure  of  Sy  soon  drives  Sy  toward  1 ,  which  we  take  as 
the  upper  bound  of  Sj.  This  means  mtJ  quickly  approaches 
a  positively  scaled  version  of  the  signal  S,. 

Now  suppose  the  yth  unit  goes  from  winning  to  losing. 
Then  at  first  y;  =  0  and  5y  =  1 .  As  Sy  quickly  falls  to  zero, 
learning  slows  then  stops  when  yj  =  Sy  =  0.  Meanwhile 
m,y  has  “moved  away”  from  the  signal  S,.  The  signal  ve¬ 
locity  Sj  has  “punished"  the  yth  unit. 

Kohonen  (24]  uses  a  sign  change  to  punish  misclassi- 
fying  prototype  vectors  trained  with  the  competitive 
learning  law  in  his  feedforward  “supervised”  adaptive 
vector  quantization  (AVQ)  system.  In  vector  formulation, 
the  p  reference  vectors  m,(t),  ....  mp(t)  are  the  respec¬ 
tive  prototypes  at  time  /  of  the  p  decision  classes  Dt, 

.  ...  Dr  that  partition  the  signal  space  R" .  The  p  refer¬ 
ence  vectors  arc  also  the  p  columns  of  the  synaptic  matrix 

M.  m,  =  (m . .  mni)  is  the  fan-in  of  synapses  of  the 

fth  neuron  in  Fy.  All  Fy  neurons  arc  engaged  in  winner- 
take-all  competition.  Given  a  random  training  sample 
vector  x(t)  presented  at  Fx ,  the  />  competition  is  sum¬ 
marized  by  finding  the  reference  vector  m/(t)  closest  to 
x(t)  in  Euclidean  distance:  ||x  -  my||  =  min  {  !|x  —  /n,||: 
i  —  I,  ....  p\.  “Supervision"  means  we  know  which 
decision  class  the  random  vector  jr  was  chosen  from.  If  x 
belongs  to  t)  the  class  represented  by  wy,  then  mJ  is  re¬ 
warded  by  moving  mt  a  little  closer  to  x.  This  allows  m) 
to  gradually  approximate  the  centroid  of  Dr  (The  cen¬ 
troid,  or  conditional  expectation,  minimizes  the  mcan- 
squared-errar  of  vector  quantization  (37).)  Else  if  x  docs 
not  belong  to  If,  m)  is  punished  for  misclassifying  x  as  a 


Dj  pattern  by  moving  my  a  little  farther  away  from  x,  pre¬ 
sumably  out  of  regions  of  misclassification.  This  is 
achieved  by  a  simple  sign  change: 


">j(‘  +  1) 


m,(t)  +  r(/)|x(0  -  /ny(0|. 

nij{t)  -  c(f)[.r(/)  -  /«,(/)], 


x  G  D) 

(14) 
x 

(15) 


m,(r  +  I )  =  >n,(t)  for  all  losing  neurons  in  Fy  ( 16) 


where  c( 0),  c(  1 ),  c(2),  ...  is  a  slowly  decreasing  se¬ 
quence  of  small  (c(0)  <  1)  learning  constants.  Koho¬ 
nen’s  “unsupervised”  AVQ  algorithm  eliminates  the 
punishment  equation  (15)  and  relaxes  (14)  by  allowing 
x(t)  to  belong  to  any  decision  class.  The  unsupervised 
algorithm  is  clearly  a  discrete  stochastic  version  of  the 
competitive  law  (3)  in  vector  notation.  Kohonen  shows 
that  under  appropriate  statistical  conditions,  'he  equilib¬ 
rium  condition  of  the  AVQ  unsupervised-clustering  al 
gorithm  occurs  when  the  p  reference  vectors  asymp¬ 
totically  arrive  at  the  centroids  of  their  respective  decision 
classes.  Kohonen  next  shows  that  the  equilibrium  condi¬ 
tion  of  the  supervised  AVQ  algorithm  is  similar  in  struc¬ 
ture  to  that  of  the  optimum  unit-cost  Bayes  classifier,  and 
cites  simulation  data  in  support  of  this  similarity. 

The  differential  competitive  law  (11)  can  be  viewed  as 
a  local  unsupervised  approximation  of  Kohonen’s  super¬ 
vised  AVQ  algorithm.  Indeed  preliminary  simulations  of 
(1 1)  in  stochastic  feedforward  mode  show  similar  classi¬ 
fication  performance  in  many  noise  environments. 

The  pulse-coded  differential  competitive  law  (12),  as 
discussed  above,  can  be  expected  to  often  behave  as  the 
competitive  law  (3)  with  0-1  threshold  signal  functi  'n  Sj. 
This  is  precisely  when  the  competitive  law  has  been  sf  >wn 
(32)  globally  stable  when  embedded  in  the  nonlinear  dy¬ 
namical  systems  below.  For  this  reason,  we  here  limit  i.’te 
stability  analysis  of  the  differential  competitive  law  to  t  l 
of  the  competitive  law  with  steep  signal  function  5,.  V  e 
simlarly  limit  the  stability  analysis  of  the  differential  Heb  > 
law  (4)  to  the  analysis  of  the  signal  Hebb  law,  even  though 
differential  Hebb  dynamical  systems  arc  known  (32),  (33) 
globally  stable  in  the  special  case  that  signal  velocities  arc 
comparable  to  signal  accelerations. 


III.  Unioirgctional  and  Bidirkctionai.  Nonlinkar 
Dynamical  Systgms 

We  study  nonlinear  dynamical  systems  described  by 
C'lhcn-Grossbcrg  [6J,  (14)  dynamics.  In  the  unidirec¬ 
tional  or  autoassociativc  case,  when  Fx  =  Fy  and  M  = 
Mt,  a  neural  network  possesses  Cohen-Grossbcrg  dy¬ 
namics  if  its  activation  equations  can  be  written  in  the 
abstract  form 


x,  =  -a,(x,) 


b,{x,)  -  Z  5y(xy)m,y 


(17) 
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where  a,(x,)  >  0  is  an  amplification  function,  b,  is  arbi¬ 
trary  so  long  as  it  keeps  the  integrals  bounded  in  the  Lya¬ 
punov  functions  below,  and  5,  is  a  bounded  monotone 
nondecreasing  (S'  >  0)  signal  function.  The  global  sta¬ 
bility  of  nonlcaming  autoassociativc  systems  described  by 
(17)  is  ensured  by  the  Cohen-Grossberg  theorem  |6], 
which  is  abstractly  equivalent— in  the  sense  that  R"  x  R1’ 
_  */’_lo  t|lc  u/VM  theorem  below  for  nonlcaming  hci- 

croassociativc  networks  and  a  special  ease  of  the  ADAM 
theorem  reviewed  in  the  next  section. 

Perhaps  the  most  important  special  eases  of  (17)  arc 
additive  and  shunting  networks,  the  popular  versions  of 
which  are  the  respective  Hopfteld  circuit  (19]  and  the 
Hodgkin-Huxlcy  membrane  equation  (18],  Grossberg 
[14],  has  also  shown  that  (17)  reduces  to  the  additive 
brain-statc-in-a-box  model  of  Anderson  (1],  (2]  and  the 
shunting  masking  field  model  (7]  upon  appropriate  change 
of  variables.  An  autoassociativc  system  has  additive  ac¬ 
tivation  dynamics  if  the  amplification  function  u,  is  con¬ 
stant  and  the  b,  function  is  linear.  For  instance,  if  a,  = 
l/C„  b,  =  (x,/R,  )  -  /„  S,( Xj)  =  gi(x,)  =  V„  and  con¬ 
stant  rrijj  =  ntjj  =  Tti  =  Ty„  where  C,  and  R,  are  positive 
constants  and  input  /,  is  constant  or  slowly  varying  rela¬ 
tive  to  fluctuations  in  x„  then  (17)  reduces  to  the  Hopfteld 
circuit  (19]: 

CA  =  -  J  +  2  vjij  +  (18) 

Ki  J 

Grossberg  [13]  has  shown  that  neurons  with  additive 
dynamics  saturate  at  their  upper  bounds  (if  they  have 
them)  when  inputs  are  arbitrarily  large,  thus  ignoring  the 
relative  pattern  information  in  the  input  pattern  (/,,..., 
/«)- 

An  autoassociative  network  has  shunting  or  multipli¬ 
cative  activation  dynamics  when  the  amplification  func¬ 
tion  n,  is  linear  and  b,  is  nonlinear.  For  instance,  if  a,  = 
-Xj,  =  I  (self-excitation  in  lateral  inhibition),  and  b, 
=  (  1/jc M-AjXj  +  B,(S,  +  I*)  -  Xj  ( Sj  +  /*)- 
CjC^j  Sjnt,j  +  lj  )],  gives  the  distance-dependent  ( m,t 
=  my)  unidirectional  shunting  network: 

x,  =  -A, x,  +  (B,  -  x,)(5,(jr,)  +  /,+  J 


-  (Q  +  *,■)  ^  Sj(xj)m,j  4-  /, 


where  A,  is  a  positive  decay  constant  and  B,  and  C,  are 
positive  saturation  constants.  The  first  term  on  the  right- 
hand  side  of  v  1 9)  is  a  passive  decay  term.  The  second  and 
third  terms  are,  respectively,  positive  and  negative  feed¬ 
back  terms.  (Strictly  speaking,  a,(x :,)  must  be  kept  posi¬ 
tive.  x,  can  always  be  translated  to  achieve  this.)  If  the 
shunting  x,  terms  in  the  positive  and  negative  feedback 
terms  are  scaled  to  zero,  (19)  reduces  to  an  additive 
model.  Grossberg  also  showed  that  shunting  models  do 
not  saturate  when  presented  with  arbitrarily  large  positive 
inputs.  They  remain  sensitive  to  the  relative  pattern  in¬ 
formation  in  (/,,  .  .  ,  /„).  Perhaps  more  important  for 


ncurohiologists,  Grossberg  (13|.  ( 1 4 1  observed  that  the 
shunting  model  (19)  is  naturally  generalized  by  the  cele¬ 
brated  Hodgkin-Huxlcy  membrane  equation: 

BV 

r— '  -  (V*  ~  K)g'!  +  (F*  -  K)g.‘  +  (F-  -  1/)*,- 


whcrc  V'' ,  V  4 ,  and  V~  arc  respective  passive,  excitatory 
(sodium  Na*  ),  and  inhibitory  (potassium  K* )  saturation 
upper  bounds  with  corresponding  shunting  conductances 
gl,  g* ,  and  g~ ,  and  where  the  constant  capacitance  c  > 
0  scales  time.  The  shunting  model  ( 19)  becomes  the  mem¬ 
brane  equation  (20)  if  F,  =  x„  Vp  =  0.  V*  =  B,,  V~  = 
~C„  g1’  =  Aj,g *  =  Sj(Xj)  +  I* ,  and  g~  =  , ,  Sjtn.j  + 

/r- 

Continuous  bidirectional  associative  memories  (28]- 
(32]  (BAM’s)  arise  when  two  (or  more)  neural  fields  Fx 
and  Fy  are  connected  in  the  forward  direction,  from  Fx  to 
Fy,  by  an  arbitrary  n-by-p  synaptic  matrix  M  and  con¬ 
nected  in  the  backward  direction,  from  Fr  to  Fx ,  by  the 
p-by-n  matrix  N  =  Mr,  where  M  T  is  the  transpose  of  M. 
BAM  activations  also  possess  Cohen-Grossberg  dynam¬ 
ics,  and  their  extensions: 


Xj  =  -a,(x,)|  b,(jr,)  -  Z  Sj(yj)mjj\  (21) 

yj  =  -Oj(yj)  bAyj )  -  ^  (22) 

J 

with  corresponding  Lyapunov  function  L : 

L=  -  ZZ  SjSjntjj  +  Z  f  S;(0,)  b,{6,)  ddj 

i  j  i  JO 

+  S  £'  s; (,,)  (,,(<,)  d,. 


where  the  functions  b,  and  bj  must  be  suitably  constrained 
to  keep  L  bounded. 

The  quadratic  form  in  L  is  bounded  because  the  signal 
functions  S,  and  Sf  arc  bounded.  Boundedness  of  the  in¬ 
tegral  terms  requires  additional  technical  hypotheses  to 
avoid  pathologies  as  discussed  by  Cohen  and  Grossberg 
(6).  For  our  purpose  we  simply  assume  the  integral  terms 
arc  bounded. 

All  BAM  results  extend  to  any  number  of  BAM-con- 
nected  fields.  Complex  topologies  are  possible  and,  in 
theory,  will  equilibrate  as  rapidly  as  the  two-layer  BAM 
system.  The  back-and  forth  flow  of  information  in  a  BAM 
facilitates  natural  large-scale  optical  implementations 
(20],  [28], 

The  BAM  model  (21).  (22)  clearly  reduces  to  the 
Cohen-Grossberg  model  if  both  neural  fields  collapse  into 
one,  Fx  =  Fr,  and  the  constant  matrix  M  is  symmetric  (M 
=  Mt).  Conversely,  the  BAM  system,  which  is  always 
globally  stable,  can  be  abstractly  viewed  (30]  as  symme¬ 
trizing  an  arbitrary  matrix  M.  For  if  the  two  BAM  fields 
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arc  abstractly  concatenated  into  a  new  field  Fz,  Fz  —  Fx 
U  Fy ,  with  zero  block  diagonal  synaptic  matrix  W  that 
contains  M  and  MT as  respective  upper  and  lower  blocks, 
then  the  BAM  dynamical  system  (21),  (22)  is  equivalent 
to  the  autoassociativc  system  (17). 

The  BAM  system  (21)  includes  additive  and  shunting 
models.  If  a,  ~  1  =  aJ%  b,  =  x,  —  /,,  and  bt  —  y,  ~  Jr 
for  relatively  constant  inputs  /,  and  Jj,  then  an  additive 
BAM  |30],  (3 1 J  results: 


(29)— (3 1 ) : 


x,  =  -x,  +  2  Sj(y,) "(y  +  /, 

> 

(23) 

y,  =  -yj  +  2  5,(x,)/niy  +  Jj 

(24) 

where  again  constants  can  be  added  or  multiplied  as  de¬ 
sired.  More  generally,  if  n,  =  —  jr,-,  a;  =  — yy,  6,  = 
( I  /x, )  [  -x,  +  (Bj  -  x, ) [ S, ( jt,)  +  /,+  ]  -  x,/f  ],  and  b, 
=  (!/»)[  -yj  +  (Bj  -  yt)[St(y/)  +J/)-  y,J-  ).  then 
a  shunting  BAM  [30]  results: 

X;  =  +  (B,  ~  Xi)[5,  +  !■  J  -  x,  2  Sjtn,;  +  l~ 

j 

(25) 

yj  —  —yj  +  ( Bj  —  y,)  (5)  +  Jj  ]  —  yt  2  S,mtl  +  Jj 

I 

/  ^  v 


The  shunting  BAM  (25),  (26)  reminds  us  that  in  general 
distance-dependent  competition  occurs  within  fields  Fx 
and  FY.  Suppose  the  n-by-rt  matrix  R  and  the  p-by-p  ma¬ 
trix  5  describe  the  distance-dependent  (/?  =  /?  T,  S  —  ST) 
lateral  inhibition  within  and  Fy,  respectively.  Then  the 
general  BAM  model  (21),  (22)  must  be  augmented  to  a 
competitive  BAM  (29): 


X,  =  -o,(x,)l  bj(Xi)  -  2S/(y/)m,/  -  2  S*(x*)r*, 


y,  =  -«/(»)  bj(y,)  ~  2s((x>,  -  2  s,( y,)sh 


An  adaptive  bidirectional  associative  memory  (ABAM) 
is  a  globally  stable  dynamical  system  with  activation  dy¬ 
namics  described  by  (21),  (22)  or  (27),  (28)  and  synaptic 
dynamics  described  by  a  first-order  learning  law.  The 
original  ABAM  (30)  restricted  the  choice  of  learning  law 
to  the  signal  llcbb  law  (I).  Signal  Hcbb  ABAM’s  arc  un¬ 
conditionally  globally  stable,  though  limited  in  their  abil¬ 
ity  to  estimate  continuous  functions.  Better,  though  more 
costly,  estimation  can  be  gotten  with  higher  order  signal 
Hcbb  ABAM’s.  For  example,  in  autoassociativc  notation, 
the  second -order  signal  Hcbb  ABAM  (32]  is  described  by 


x,  =  -a,(x,)  b,{x,)  -  2  S,(x,)»i,y 

/ 


-22  S^)  Sk(xk)nl)k 

J  k 

(29) 

-mtj  +  S,(x,)  Sy(xy) 

(30) 

~n,jk  +  S,(x,)  SjiXj)  Sk(xk) 

(31) 

with  corresponding  Lyapunov  function  L: 

l  =  ~ )  2  2  SjSjmjj  —  5  2  2  2  s, Sj sk nljk 

<  J  I  J  k 

+  2  f  s;(0,)  b,(o.)  do,  +  i  2  2  ml 

i  JO  *  j 

+  A  2  2  2  n\k.  (32) 

'  i  k 

The  Lyapunov  function  remains  bounded  in  the  adap¬ 
tive  case.  The  new  terms 

A  2  2  mjj  and  £  2  2  2  njJk  (33) 

*  J  i  j  k 

in  (32)  are  bounded  because  the  solutions  to  (30)  and  (31) 
are  bounded  since,  ultimately,  the  signal  functions  5,  are 
bounded. 

If  fl,(x,)  >  0  and  S'j  >  0,  and  if  (32)  is  differentiated 
with  respect  to  time,  rearranged,  and  (29),  (30)  are  used 
to  eliminate  terms,  then  L  strictly  decreases  along  trajec¬ 
tories,  yielding  asymptotic  stability  (and  in  general  ex¬ 
ponential  convergence),  since 

i,„-S£feLf_iss*§ 

.  a,(x,)  2  i  ) 

-  ^  2  2  2  hjjk  <  0  (34) 

if  any  activation  or  synaptic  velocity  is  nonzero.  The  strict 
monotonicity  assumption  5,'  >  0  and  (33)  further  imply 
that  L  =  0  if  and  only  if  all  parameters  stop  changing:  x, 
=  m,j  =  hijk  =  0  for  all  i,  j,  k.  All  like  higher  order 
ABAM’s  arc  globally  stable. 

The  restriction  to  signal  Hebbian  learning  was  relaxed 
(32)  to  allow  competitive  learning  with  (3)  provided  S;  is 
steep,  and  further  relaxed  to  allow  differential  Hebbian 
learning  with  (4)  provided  signal  velocities  and  signal  ac¬ 
celerations  agree  in  sign.  A  competitive  ABAM  (CA- 
BAM)  results  from  (27),  (28)  if  learning  is  governed  by 
the  competitive  learning  law  (3)  and  if  5,  behaves  essen¬ 
tially  as  a  0-1  step  function.  For  then,  upon  time  differ¬ 
entiation,  the  appropriate  Lyapunov  function  L  takes  the 
form 

L=  -Z^x?  -  2  —  y/ 

•  a,  j  a. 


2  2  mn [ S, ( x, )  Sf(y,)  -  m,,].  (35) 


so 
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The  trick  is  to  eliminate  wi,;  in  (34)  with  the  competitive 
law  (3)  and  exploit  the  0-1  threshold  (steep-sigmoid)  be¬ 
havior  of  Sj.  Then  the  relevant  product  becomes  non¬ 
negative: 


miy(5,5/  -  w,y]  =  5;(.S’,  -  /«,,)( 5,5)  -  OTy] 

fo.  $>(>,)  =  o 

L(S,  -  Sj(yj)  =  1. 


Thus  both  winners  and  losers  in  Fr  keep  L  decreasing 
and  ensure  that  every  CABAM  is  globally  stable. 

CABAM’s  are  topologically  equivalent  to  adaptive  res¬ 
onance  theory  (ART)  systems  [13].  The  idea  behind  ART 
systems  is  learn  only  if  resonate.  Resonance,  though,  is 
simply  joint  stability  at  F*and  Fr  mediated  by  the  forward 
connections  M  and  the  backward  connections  N.  When  N 
=  Mt  and  activation  dynamics  are  described  by  (27),  (28), 
ART  models  become  CABAM  models  so  long  as  learning 
is  described  by  a  globally  stable  learning  law,  in  partic¬ 
ular  the  competitive  law  (3)  with  steep  signal  function  Sj. 
This  is  the  case  with  the  recent  ART-2  model  [5]  since 
the  activation  (short-term  memory )  dynamics  of  Fx  and 
Fy  are  described  by  shunting  equations  and,  in  the  nota¬ 
tion  of  Carpenter  and  Grossberg,  the  learning  (long-term 
memory)  dynamics  are  described  by  CABAM-style  com¬ 
petitive  learning  laws  with  threshold  signal  functions  in 

Fr: 


top-down  (Fr  -  Fx):iji  =  g(yj)[pi  -  Z*]  (36) 
bottom-up  (Fx  -  Fr):Zij  =  g(yj)[Pi  -  z,,]  (37) 
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Fig-  2. 


formation  by  using  the  differential  Hcbb  learning  law  (33]: 


x,  =  -o,(x,) 


y,  =  -ojiyj) 


bi(Xi)  -  Z  Sjmv  -  Z  Sy/n.yj 
bj(yj)  ~  Z  Sitn-  -  Z  S.m.yl 


ihjj  —  ~mtJ  -4-  SjSj  +  S/Sj 


(38) 

(39) 

(40) 


and  the  further  assumptions  S,  =  Sj  «=  Sj,  where  in 
general  (40)  can  be  loosened  to  only  require  that  signal 
velocities  and  accelerations  tend  to  have  the  same  sign  (as 
in  clipped  exponentials).  The  corresponding  Lyapunov 
function  now  includes  a  “kinetic  energy”  term  to  account 
for  signal  velocities: 


where  g  is  a  threshold  signal  function  and  p,  is  the  signal 
pattern  (itself  involving  complicated  L2-norm  computa¬ 
tions)  transmitted  from  Fx.  Equation  (36)  says  matrix  Z 
contains  forward  projections  and  its  transpose  Zr  contains 
backward  connections. 

In  contrast,  the  earlier  binary  ART-1  model  (4]  is  not 
extended  by  the  CABAM  model  because  Weber  law  struc¬ 
ture  is  imposed  on  the  forward  “bottom-up”  synaptic 
projections,  and  thus  the  forward  and  backward  connec¬ 
tion  matrices  are  not  related  by  transposition.  This  in  part 
explains  why  binary  inputs  in  ART-2  need  not  produce 
ART-1  behavior.  It  also  suggests  that  the  ART-2  model 
can  in  principle  be  similarly  modified  by  adding  Weber 
law  structure  to  (36),  producing  an  ART-2’  model  that  is 
not  a  CABAM. 

These  connections  among  unsupervised  feedback  dy¬ 
namical  systems  are  summarized  by  the  taxonomy  in  Fig. 
2  of  artificial  neural  networks  (ANN’s)  and  placed  in  con¬ 
text  with  unsupervised  feedforward  adaptive  vector  quan¬ 
tizers  and  the  extremely  popular  supervised  feedforward 
gradient-descent  networks: 

The  more  general  RABAM  model  is  developed  below. 

Finally,  for  completeness,  we  state  the  form  of  ABAM 
systems  that  adapt  (and  activate)  with  signal  velocity  in- 


L  —  —  Z  Z  St  Sj mt]  —  Z  Z  SjSjrttjj 

i  j  i  j 

+  Z  f  s;(0,)  b,(6,)  do, 

i  Jo 

+  Z  f  SJ(fj)  bj(tj)  d(j  +  |  Z  Z  mfj. 

j  Jo  >  J 

IV.  Stability-Convergence  Dilemma  and  the 
ABAM  Theorem 

Stability  and  convergence  are  equilibrium  properties. 
Stability  is  equilibrium  in  a  neuronal  field:  (d/Jt)Fx  — 
O.  Convergence  is  equilibrium  in  a  synaptic  web: 
(d/dt)M  =  O.  Global  stability  is  joint  stability  and  con¬ 
vergence  for  all  inputs  and  all  network  parameters.  Pat¬ 
tern  formation  occurs  across  field  Fx  when  it  stabilizes. 
The  stable  signals  across  Fx  make  up  the  formed  pattern. 
Stability  is  trivial  in  a  feedforward  network. 

Global  stability  is  difficult  to  achieve  in  unsupervised 
feedback  networks.  After  all,  most  feedback  systems  arc 
unstable.  Global  stability  requires  a  delicate  dynamical 
balance  between  stability  and  convergence.  Achieving 
such  a  balance  is  arguably  the  central  problem  in  analyz¬ 
ing,  and  building,  unsupervised  feedback  dynamical  sys- 
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tcms.  The  chief  difficulty  stems  from  the  dynamical  asym¬ 
metry  between  neural  and  synaptic  fluctuations.  Neurons 
fluctuate  orders  of  magnitude  faster  than  synapses:  learn¬ 
ing  is  slow.  In  real  neural  systems,  neuronal  fluctation 
may  be  at  the  millisecond  level,  while  synaptic  fluctuation 
may  be  at  the  second  or  even  minute  level. 

The  stability-convergence  dilemma  arises  from  the 
asymmetry  in  neuronal  and  synaptic  fluctuation  rates.  The 
dilemma  unfolds  as  follows.  Neurons  change  faster  than 
synapses  change.  Patterns  form  when  neurons  stabilize, 
when  ( d/dt)Fx  =  0  and  ( d/dt)FY  =  O.  The  slowly 
varying  synapses  M  try  to  learn  these  patterns.  Since  the 
neurons  are  stable  for  more  than  a  synaptic  moment,  the 
synapses  begin  to  adapt  to  the  neuronal  patterns— learning 
begins.  So  (d/dt)Fx  =  0  and  (d/dt)FY  =  0  imply 
( d/dt)M  ^  0.  Since  there  are  numerous  feedback  paths 
from  the  synapses  to  the  neurons,  the  neurons  tend  to 
change  state.  So  ( d/dt)M  ^  O  implies  ( d/dt)Fx  =£  0 
and  ( d/dt)Fy  ^  0.  Learning  tends  to  undo  the  very  sta¬ 
bility  patterns  to  be  encoded,  and  hence  the  dilemma.  In 
summary,  for  two  fields  of  neurons  Fx  and  FY  connected 
in  the  forward  direction  by  M  and  in  the  backward  direc¬ 
tion  by  Mr ,  the  stability -convergence  dilemma  has  four 
parts,  described  as  follows. 

A.  Stability-Convergence  Dilemma 

1)  Asymmetry:  Neurons  in  Fx  and  Fy  fluctuate  faster 
than  the  synapses  M. 

d  d 

2)  Stability:  —  Fx  =  O  and  —  Fr  =  0(  pattern  forma¬ 
tion). 

d  d  d 

3)  Learning:  ~FX  =  O  and  —FY  =  O  -*  —M  *  0. 

dt  dt  dt 

d  d  d 

4)  Undoing:  —M  =£  O  -»  —Fx  ^  O  and  —FY  =£  O. 

dt  dt  dt 

The  ABAM  theorem  [32]  provides  one  resolution  of  the 
stability-convergence  dilemma.  The  adaptive  resonance 
concept  provides  another.  Though  as  discussed  in  the  pre¬ 
vious  section,  the  recent  ART-2  instantiation  of  the  con¬ 
cept  is  a  CABAM.  The  ABAM  theorem  ensures  the  global 
stability,  the  joint  stability  and  convergence,  of  dynami¬ 
cal  systems  with  activation  dynamics  described  by  (21) 
and  (22)  and  that  learn  according  to  the  signal  Hebb  learn¬ 
ing  law  ( I ).  The  extensions  to  competitive  and  differential 
Hcbbian  learning  (and  thus  differential  competitive  learn¬ 
ing)  discussed  above  all  require  more  assumptions  than 
learning  with  the  signal  Hebb  law,  which  requires  none. 
Since  the  ABAM  theorem  is  the  starting  point  for  the  ran¬ 
dom-process  extension  to  the  RABAM  theorem  below,  we 
review  its  statement  and  proof. 


I).  AHAM  Theorem 

Evcr;  signal  Hebb  BAM  is  asymptotically  stable,  where 
the  network  dynamics  are  described  by 


=  -«.( O 


MO  -  2  M  y 


(41) 


y,  =  -«,(>)) 
=  -•%  + 


b j  (  S,  (  x, )  m,j 

I 

S,(x,)  Sj (  v, ) 


(42) 

(43) 


and  a,  >  0  and  r/y  >  0,  and  S,  and  S;  are  bounded 
monotone  increasing  (S'  >  0  and  SJ  >  0)  signal  func¬ 
tions.  At  equilibrium,  all  activation  and  synaptic  veloci¬ 
ties  arc  zero. 

Proof.  Consider  the  global  Lyapunov  function  L: 


L  = 


—  22  S,  Sj  + 

<  j 


s;  (o, )b,(oi)  do, 


+  £  (  S'(tj)  bj(tj)  d(j  +  1  2  2 

Jo 


(44) 


Then  time  differentiation  and  collection  of  like  terms  gives 


-  2  s;x, 

i  \ 

b,  -  2  Sjm,j 
j 

+  2  SJ  y, 

J 

bj  -  2  S,mt/ 

1 

-22  thijlS/Sj  -  m,y] . 

•  j 

(45) 

Then,  using  the  positivity  of  a,  and  ar  the  terms  in  braces 
can  be  eliminated  with  the  respective  equations  (4 1 >—(43). 
This  proves  that  L  is  strictly  decreasing  along  trajectories: 


=  -Z^x? 
.  a, 


-zsl 

j  a. 


y]  -  2  2  mfj  <  0  (46) 


for  any  activation  or  synaptic  change.  Since  Sj  >  0  and 
SJ  >  0,  L  =  0  if  and  only  if  x,  =  y;  =  m,-,  =  0  for  all  i 
and  j.  Q.E.D. 

The  strictly  inequality  sign  in  (46)  yields  asymptotic 
stability,  which  ensures  that  trajectories  end  in  equilib¬ 
rium  points,  not  merely  near  them.  Asymptotic  stability 
also  ensures  that  the  eigenvalues  of  the  Jacobian  matrix 
of  the  system  (4 1 )— (43)  have  nonpositive  real  parts  near 
equilibria.  A  nondegenerate  Hessian  further  ensures  that 
the  real  parts  of  the  eigenvalues  arc  negative.  Then  [16] 
the  nonlinear  system  (4 1 )— (43)  converges  exponentially 
quickly  as  if  it  were  linear. 


V.  Random  Adaptive  Bidirectional  Associative 
Memories 

Random  adaptive  bidirectional  associative  memory 
(RABAM)  models  are  everywhere  perturbed  by  Brownian 
diffusions.  The  differential  equations  in  (4 1 )— (43)  now  be¬ 
come  stochastic  differential  equations,  with  random  pro¬ 
cesses  as  solutions.  In  the  simplest  case.  Brownian  dif¬ 
fusions  are  simply  added  to  deterministic  differential 
equations.  In  the  more  general  case  adopted  here,  every 
activation  and  synaptic  variable  represents  a  separate  sto¬ 
chastic  process.  The  stochastic  differential  equations  re¬ 
late  the  time  evolution  of  these  stochastic  processes. 
Brownian  diffusions,  or  “noise”  processes,  are  then 
added  to  the  stochastic  differential  equations.  In  principle 
this  Ito  calculus  approach  need  not  preserve  the  chain  rule 
of  deterministic  differential  calculus.  The  final  section. 


5)2 
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(hough,  discusses  why  for  RABAM  models  the  classical 
chain-rule  relationships  still  hold. 

Let  B,,  B ,  and  B,f  be  Brownian  motion  (independent 
Gaussian  increment)  processes  |35|,  |41)  perturbing  the 
uh  neuron  in  Fx,  the  ylh  neuron  in  Fr,  and  the  synapse  m,r 
respectively.  The  Brownian  motions  arc  allowed  to  have 
time-varying  diffusion  parameters.  Then  the  dijfusicn 
RABAM  is  described  by  (47)— (49) : 


dx,  =  ~a,(x,) 


dy,  =  ~aj(y,) 


b,(x,)  -  X  Sji}))  tntj 

J 

bj(yj)  -  2  Si(Xi)mij 


dm,,  =  -mtJ  dt  +  S,(x,)  5y(y;)  dr  +  dB 


dt  +  dll, 

(47) 

dt  +  dBt 

(48) 

T 

(49) 

The  signal  Hebb  diffusion  law  (49)  can  be  replaced  with 
the  competitive  diffusion  law 


dm.j  =  S/(yy)(S,  -  m,j\dt  +  dB,j  (50) 


if  Sj  is  sufficiently  steep.  Or  it  can  be  replaced  with  dif¬ 
ferential  Hebb  or  differential  competitive  diffusion  laws  if 
tighter  constraints  are  imposed.  For  simplicity,  we  shall 
formulate  the  RABAM  model  in  the  signal  Hebb  case 
only.  The  extensions  to  competitive  and  differential  learn¬ 
ing  proceed  exactly  as  the  above  extensions  of  the  ABAM 
theorem.  All  RABAM  results,  like  all  ABAM  results,  also 
immediately  extended  to  high-order  systems  of  arbitrarily 
high  order. 

The  RABAM  model  can  be  restated  in  more  familiar, 
less  rigorous,  “noise  notation.”  Intuitively  independent 
zero-mean  noise  is  added  to  the  ABAM  model.  The  sto¬ 
chastic  differential  equations  then  describe  the  time  evo¬ 
lution  of  network  “signals  plus  noise.”  This  implicitly 
means  that  the  noise  processes  are  independent  of  the 
nonlinear  “signal”  processes.  For  emphasis,  though,  we 
explictly  make  the  weaker  assumption  that  the  noise  pro¬ 
cesses  are  uncorrelated  with  the  “signal"  processes.  We 
further  assume  that  the  noise  processes  have  finite  vari¬ 
ances,  though  they  may  be  time  varying.  Then  the  noise 
RABAM  model  is  described  by  the  stochastic  differential 
equations 


X, 

=  -«,(*,) 

!’,(*,)  - 

^  S,(  y,)m,' 

J 

+  n, 

(51) 

y, 

-  -«/U) 

M>/)  “ 

Z  S,{x,)m,' 
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(52) 
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(53) 
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(-54) 
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perhaps  reflecting  random  input  signals.  A  separate  anal¬ 
ysis  1 34 )  shows  that  additive  input  noise  can  be  accom¬ 
modated  for  additive  and  shunting  activation  models.  For 


additive  activation  models,  such  additive  activation  noise 
can  be  included  in  the  noise  terms  n,  and  nt 

Will  so  much  noise  destabilize  the  system?  So  much 
noise  with  so  much  feedback  would  seem  to  promote 
chaos,  especially  since  the  network  dimensions  n  and  p 
can  be  arbitrarily  large.  How  can  stable  learning  occur? 

The  RABAM  theorem  ensures  stochastic  stability. 
Nonlinear  interactions  suppress  noise  and  suppress  it  ex¬ 
ponentially  quickly.  In  effect,  RABAM  equilibria  arc 
ABAM  equilibria  that  randomly  vibrate.  The  diffusion  pa¬ 
rameters.  or  the  noise  variances,  control  the  range  of  vi¬ 
bration.  Average  RABAM  behavior  is  just  ABAM  behav¬ 
ior.  Since  noise  perturbations  do  not  destroy  equilibria, 
the  RABAM  theorem  says  that  unsupervised  learning  is 
structurally  stable  in  the  stochastic  sense.  The  result  ap¬ 
plies  with  equal  force,  though  with  less  theoretical  inter¬ 
est,  for  unsupervised  learning  in  feedforward  networks. 

The  RABAM  theorem  can  be  motivated  with  a  simple 
thought  experiment  or,  better,  a  few  hand  calculations. 
Consider  a  discrete  additive  BAM  with  fixed  matrix  M. 
Find  its  bipolar  fixed  points  in  the  product  space  {  —  1, 

1  } "  X  {  —  1 ,  1  } />.  Now  add  a  small  amount  of  zero-mean 
noise  to  each  memory  element  m,y.  Since  a  discrete  BAM 
signal  function  is  a  threshold  function,  it  is  unlikely  that 
more  than  very  few  neurons,  if  any,  change  state  differ¬ 
ently  during  iterations  than  they  did  before.  It  is  even  less 
likely  that  they  will  do  so  as  n  and  p  increase.  The  same 
fixed  points  tend  to  be  reached,  and  tend  to  persist  once 
reached.  This  corresponds  to  adding  noise  at  the  synaptic 
level.  Now  repeat  the  computation,  but  also  add  zero- 
mean  noise  to  each  neuron’s  activation  at  each  iteration. 
Then  repeat  this  computation,  adding  new  noise  to  the 
matrix  M  each  time.  This  allows  the  synaptic  noise  pro¬ 
cesses  to  be  “lower”  than  the  neuronal  noise  processes. 
Again  the  threshold  signal  functions  make  it  unlikely  that 
the  signal  patterns  will  change  significantly,  if  at  all,  dur¬ 
ing  iterations  or  in  equilibrium. 

A.  RABAM  Theorem 

The  RABAM  model  (47)-(50),  or  (51)-(55)  is  globally 
stable.  If  signal  functions  are  strictly  increasing  and  am¬ 
plification  functions  a,  and  ay  are  strictly  positive,  the  RA¬ 
BAM  model  is  asymptotically  stable. 

Proof.  The  ABAM  Lyapunov  function  (44)  is  now  a 
random  process.  At  each  time  /,  L(t)  is  a  random  ■vari¬ 
able.  We  conjecture  that  the  expected  ABAM  Lyapunov 
function  E(  L)  is  a  Lyapunov  function  for  the  RABAM  sys¬ 
tem,  where  the  expectation  is  with  respect  to  all  random 
parameters: 

/:(/.)  --  J  -  -  [  Lp(X.  Y,  M)dXdYdM.  (56) 

(Recall  that  each  activation  and  synaptic  parameter  rep¬ 
resents  a  random  process  separate  from  the  random  pro 
cess  got  simply  by  adding  noise  to  a  deterministic  vari¬ 
able.) 

The  proof  strategy  is  to  replace  the  time  derivative  of 
the  expectation  with  the  expectation  of  the  time  derivative 
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of  the  ABAM  Lyapunov  function,  which  we  calculated 
abo''c.  Technically  we  need  to  assume  sufficient  smooth 
ness  conditions  on  the  RABAM  model  to  bring  the  time 
derivative  inside  the  multiple  integrals  in  (56).  This  as¬ 
sumption  adds  little  burden.  Then 

£(£)  =  £(£)  and  by  (45) 


Zj  S'  x, 

i 


+  Z  S'yj 


*>, 


havior  depends  fundamentally  on  the  variances  of  the  ad¬ 
ditive  noise  processes.  Observe  that  the  zero-mean  as¬ 
sumption  (54)  implies  that  the  time-varying  “variances” 
a],  o2,  and  a2  arc  the  respective  instantaneous  mean- 
squared  “noises”  £(«2),  £(/i2),  and  £(/i2),  since  in  gen¬ 
eral  V(x)  =  £(x2)  -  £2(j r). 

Observed  RABAM  second-order  behavior  consists  of 
the  observed  instantaneous  mean-squared  velocities 
£(x2),  £(y2),  and  £(m2).  The  mean-squared  velocities 
measure  the  magnitude  of  instantaneous  RABAM  change. 
They  are  at  least  as  large  as  the  underlying  instantaneous 
“variances”  of  the  activation  velocity  and  synaptic  ve¬ 
locity  processes,  since,  for  example 


-  -m.j  +  SfSy)  j 

r  r  i2 

£  -  Z  s;a,  b,  -  Z  Sjtn.j 

•  j 


l2 

-  Z  Sjoj  bj  —  Z  5,mv 

1  |_  > 

~  Z  Z  [  -m,  +  5)5)] 

2  >  ) 

+  Z  £^n,5;J^6,  -  Z  Sjfm&J  j 

-  ZZ£{n(>[-miy  +  5,5,]}  (57) 

upon  eliminating  the  activation  and  synaptic  velocities  in 
(57)  with  the  RABAM  dynamical  equations  (5 1)— (53) 

=  £[£-abam]  +  Z  £(/»,)  £^5/  b,  -  Z  Sjntjj  j 
+  Z^eU'L  -  Z  5,m„]j 

'  (.  -  J 

-  Z  Z  E(n,t)  £[-m,,  +  5,5,]  (58) 

>  / 

by  the  uncorrclatcdness  (independence)  of  the  “signal” 
and  additive  noise  terms  in  the  RABAM  model,  and  by 
the  facts  that  5'  and  5,'  are  nonnegativc  functions  of  xt  and 
y )  respectively,  and  and  a)  arc  nonnegative  essentially 
arbitrary  functions  (so  5'  =  a,  and  Sj  =  at  possible) 

=  £[  f'ARAM  1 


£(x2)  >  E(xJ)  -  E7(xj)  =  P(x,).  (59) 

Intuitively  the  mean-squared  velocities  should  depend 
on  the  instantaneous  “variances”  of  the  noise  processes 
in  (5 1 )— (53).  The  more  the  noise  processes  hop  about  their 
means,  the  greater  the  potential  for  the  activations  and 
synapses  to  change  state.  But  this  intuition  seems  to  run 
counter  to  the  structural  stability  established  by  the  RA¬ 
BAM  theorem.  Surely,  it  seems,  if  the  magnitudes  of  the 
noise  fluctuations  grow  arbitrarily  large,  there  comes  a 
point— and  perhaps  a  point  quickly  reached  in  the  midst 
of  massive  noisy  feedback— where  the  RABAM  system 
transitions  from  stability  to  instability. 

The  RABAM  noise  suppression  theorem  guarantees  that 
no  noise  processes  can  destabilize  a  RABAM  if  the  noise 
processes  have  finite  instantaneous  variances.  (Cauchy 
noise,  for  example,  in  theory  could  destabilize  a  RABAM 
since  it  has  infinite  variance.  In  practice,  though,  even 
Cauchy  variance  is  finite,  and  so  it  will  never  destabilize 
a  RABAM.)  Preliminary  simulations  [43],  where  noise 
fluctuations  are  many  orders  of  magnitude  greater  than 
activation  and  synaptic  fluctuations,  have  confirmed  this 
surprising  prediction.  In  some  sense  noise  cannot  beat 
RABAM  stability.  Moreover,  the  RABAM  noise  suppres¬ 
sion  theorem  ensures  that  noise  will  be  “quenched,”  to 
use  Grossberg’s  term  (13),  exponentially  quickly  in  most 
cases. 

To  prove  the  RABAM  noise  suppression  theorem,  we 
must  make  explicit  how  RABAM  instantaneous  mean- 
squared  velocities  depend  on  the  underlying  instantaneous 
noise  variances.  The  following  lemma  grounds  the  intui¬ 
tion  that  observed  second-order  behavior— the  instanta¬ 
neous  mean-squared  velocities— involves  at  least  as  much 
fluctuation  as  is  found  in  the  noise  itself. 

Lemma: 

£(x2)>a2.  £(y2)>o2,  £(m2)  >  o2.  (60) 


by  (54).  So  £(£)  <  0  or  £(£)  <  0  along  trajectories 
according  as /.ABAM  <  0  or  £ABAM  <  0.  Q.E.D. 

VI.  Noise-Saturation  Dilemma  ano  the  RABAM 
Noise  Suppression  Theorem 

How  much  do  RABAM  trajectories  and  equilibria  vi¬ 
brate?  To  answer  this  question  we  need  to  examine  the 
second  order  behavior  of  the  RABAM  model.  This  be- 


Proof.  All  three  inequalities  arc  proved  by  squaring 
both  sides  of  the  RABAM  equations  (5 1  )-(53).  taking  ex¬ 
pectations.  and  using  (54)  and  the  fact  that  the  noise  is 
uncorrclated  with  the  additive  nonlinear  “signal"  terms. 

Q.E.D. 

It  is  not  true  that  the  squared  velocity  processes  are 
never  less  than  the  squared  noise  processes  at  every  in¬ 
stant.  It  is  only  true  on  average  at  every  instant. 
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Grossberg’s  noise-saturation  dilemma  |13|  motivates 
the  use  of  the  term  "noise  suppression"  in  the  RAHAM 
corollary  below.  The  noise-saturation  dilemma  asks  how 
neurons  can  have  an  elective  infinite  dynamical  range 
when  they  operate  between  upper  and  lower  bounds  and 
yet  not  treat  small  input  signals  as  noise:  "If  the  ,r,  are 
sensitive  to  large  inputs,  then  why  do  not  small  inputs  get 
lost  in  internal  system  noise?  If  the  x,  arc  sensitive  to  small 
inputs,  then  why  do  they  not  all  saturate  at  their  maximum 
values  in  response  to  large  inputs?"  1 14|  This  vexing  and 
ubiquitous  dilemma,  it  even  confronts  the  salesperson  who 
trys  to  balance  her  presentation  between  “little"  and 
"big"  customers,  is  the  supreme  motivator  behind  Gross- 
berg’s  shunting-model  perspective  of  neural  networks. 

Grossberg  resolves  the  saturation  half  of  the  dilemma 
by  showing  |I3],  as  mentioned  above,  that  shunting 
models  remain  sensitive  to  relative  pattern  information 
over  a  wide  range  of  inputs.  He  also  shows  that  additive 
models  quickly  saturate  to  upper  bounds  for  large  inputs. 
Indeed  this  saturation  invariance  result  is  arguably  Gross- 
berg’s  greatest  achievement.  Besides  giving  information- 
processing  insights  into  the  global  dynamics  of  Hodgkin- 
Huxlcy  type  networks,  it  also  drives  Grossberg’s  concep¬ 
tion  and  implementation  of  ART  behavior,  and  is  at  the 
heart  of  his  recent  vision  theory.  On  the  other  hand,  as 
Carver  Mead  and  other  neural  VLSI  designers  have  ob¬ 
served,  it  is  well  known  that  a  simple  logarithmic  trans¬ 
duction  of  local  input  light  intensity  into  electric  potential 
in  the  visual  system  achieves  in  one  stroke  both  sensitivity 
to  input  light  intensities  over  many  orders  of  magnitude 
and  “discounts  the  illuminant"  [14]  by  equating  voltage 
differences  to  logarithms  of  intensity  ratios. 

Grossberg’s  resolution  of  the  noise  half  of  the  noise- 
saturation  dilemma  is  far  less  satisfactory.  Grossberg  {13] 
argues  that  noisy  patterns  arc  uniform  input  patterns  and 
that,  for  a  particular  small  threshold  value,  uniform  noise 
is  "suppressed"  by  all  neurons  in  the  field  shutting  off. 
Besides  the  dependence  on  a  specific  noise  threshold,  this 
argument  is  objectionable  on  at  least  two  counts.  First, 
noise  permeates  all  parameters  and  all  signals  and  cer¬ 
tainly  need  not  be  uniform.  Grossberg  admits  this  in  his 
above  description  of  the  noise-saturation  dilemma  when 
he  asks  why  small  inputs  do  not  "get  lost  in  internal  sys¬ 
tem  noise."  System  noise  makes  everything  "jiggle,”  in¬ 
cluding  relative  input  pattern  values.  This  is  the  noise 
modeled  by  the  additive  noise  processes  in  the  RABAM 
equations  (5  f ) -(53)  or,  more  realistically,  by  the  additive 
diffusion  processes  in  the  diffusion  RABAM  equations 
(47)-(49). 

Second,  shutting  off  neurons  to  suppress  noise  seems 
akin  to  curing  the  patient  by  killing  him.  The  goal  is  to 
continue  "computing"  as  accurately  as  possible  no  mat¬ 
ter  how  noisy  the  environment.  Background  noise  can  be 
high  in  feedback  systems  where  noise  can  multiply  by  re 
circulating.  In  fairness,  Grossberg  1 14]  argues  that  special 
classes  of  signal  functions,  especially  sigmoid  signal 
functions,  help  quench  pattern  noise  by  contrast-enhanc¬ 
ing  input  signals.  Signal  function  nonlincaritics  surely 
help  suppress  this  special  occurrence  of  noise.  But  what 
about  synaptic  noise?  What  about  joint  synaptic  and  ac¬ 


tivation  noise?  What  about  noise  compounded  by  feed¬ 
back?  How  do  we  know  such  pervasive  noise  will  not  pre¬ 
vent  an  ART  system  from  adaptively  resonating,  or  ruin 
an  adaptive-resonance  equilibrium  once  achieved? 

The  RABAM  noise  suppression  theorem  is  an  alterna¬ 
tive  resolution  of  the  noise  half  of  the  noise-saturation 
dilemma.  It  guarantees  that  second-order  behavior  in  RA¬ 
BAM  systems  is  as  good  as  it  can  be:  mean-square  veloc¬ 
ities  decrease  exponentially  quickly  to  their  lower  liounds. 
As  the  above  lemma  shows,  these  lower  bounds  arc  just 
the  underlying  driving  noise  variances.  1'hus  the  observed 
fluctuations,  the  mean-squared  velocities,  track  the  unob¬ 
served  noise  fluctuations.  Unaided  feedback  intuitions 
might  easily  lead  to  the  prediction  that,  in  light  of  the 
lemma,  mean-squared  velocities  may  tend  toward  infin¬ 
ity,  especially  for  widely  fluctuating  noise  processes. 


A.  RABAM  Noise  Suppression  Theorem 

For  strictly  increasing  signal  functions  S and  Sy,  posi¬ 
tive  amplification  functions  a,,  and  nondegcncratc  Hes¬ 
sian  conditions:  as  the  RABAM  system  (51)-(55)  con¬ 
verges  exponentially  quickly,  mean-squared  activation 
and  synaptic  parameters  decrease  to  their  lower  bounds 
exponentially  quickly: 

E(x})lcl  E(y])lo],  (61) 

Proof.  The  proof  uses  the  asymptotic  convergence  es¬ 
tablished  in  the  above  RABAM  theorem  for  the  monoton¬ 
icity  and  positivity  assumptions  and  the  lower  bound  on 
mean-square  velocities  established  in  the  lemma  (60). 
Then 
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by  using  the  positivity  of  the  amplification  functions  and 
(5l)-(53)  to  eliminate  the  terms  in  braces  in  (57)  in  the 
proof  of  the  RABAM  theorem 
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by  using  (5l)-(53)  again  to  eliminate  activation  and  syn¬ 
aptic  vclocitic  n  the  second  expectation  above,  rearrang¬ 
ing,  and,  as  in  the  proof  of  the  RABAM  theorem,  using 
the  uncorrelatcdncss  of  noise  and  “signal"  terms  in  (51)- 
(53)  as  discussed  above  to  obtain  (59) 
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by  the  zero-mean  noise  assumption  (54)  and  rearrange¬ 
ment.  The  lemma  ensures  that  the  double  sum  is  nonneg¬ 
ative.  The  RABAM  theorem  establishes  that  the  Lyapu¬ 
nov  function  E(L)  strictly  decreases  along  trajectories, 
and  thus  trajectories  end  at  equilibrium  points  and  arrive 
there  exponentially  quickly.  This,  together  with  the  pos¬ 
itivity  (and  well  bchavedness  [34])  of  the  weight  ratios 
S'/a,  yields  the  equilibrium  conditions: 

£(*1)  =  <r,2.  £(y>)  =  ah  £("'■>)  =  °tj -  (63) 

This  implies  (62).  Q.E.D. 

The  RABAM  noise  suppression  generalizes  the  equilib¬ 
rium  conditions  obtained  in  the  ABAM  theorem  in  the 
asymptotic-convergence  case.  For  if  the  instantaneous 
“variances”  in  (63)  are  zero,  then  [38]  the  squared  ve¬ 
locities,  and  thus  the  velocities,  are  zero  almost  every¬ 
where.  The  zero-variance  case  is  the  deterministic  case. 
The  sigma-algebra  of  the  probability  space  is  degenerate; 
it  only  contains  the  whole  space  and  the  null  set.  Thus  the 
activation  and  synaptic  velocities  are  zero  everywhere,  as 
in  the  strict  ABAM  case.  Also  note  that  throughout  the 
proofs  of  the  RABAM  theorem  and  the  RABAM  noise 
suppression  theorem,  the  synaptic  terms  are  easier  to  work 
with,  and  the  results  are  “cleaner,”  because  they  do  not 
possess  nonlinear  signal  and  amplification  terms.  We  re¬ 
call  again  that  the  above  two  theorems  are  also  valid  for 
suitably  randomized  competitive,  differential  Hebb,  and 
differential  competitive  learning  laws  under  appropriate 
conditions. 

VII.  RABAM  Annealing  and  the  Ito- 
Stratonovich  Stochastic  Calculus 


gous  to  the  convergence  in  distribution  found  in  central 
limit  theorems).  The  result  is  not  true  for  convergence 
with  probability  one  or  even  convergence  probability. 
There  is  still  some  probability  that  the  system  state  will 
bounce  out  of  global  or  near-global  minima  as  "cooling" 
finishes. 

We  now  extend  the  RABAM  theorem  and  RABAM 
noise  suppression  theorem  to  include  simulated  annealing 
in  the  general  Gcman-Hwang  sense.  For  this  we  intro¬ 
duce  the  activation  "temperatures"  or  annealing  sched¬ 
ules  Tj(t)  and  7)(r)  and  the  synaptic  schedules  TtJ(t).  The 
temperatures  are  nonnegative  deterministic  functions.  So 
they  can  be  brought  outside  all  expectations  in  proofs. 
The  RABAM  annealing  model  is  more  general  than  the 
Gcman-Hwang  gradient  model,  and  vastly  more  general 
than  popular  additive-activation  annealing  models,  be¬ 
cause  learning  is  permitted  and  because  learning  too  can 
be  annealed,  although  perhaps  at  a  different  rate  than  ac¬ 
tivation  annealing.  The  RABAM  annealing  model  is  de¬ 
fined  by  scaling  the  diffusion  differentials  in  (47)-(49) 
with  the  square  root  of  the  corresponding  annealing 
schedules  or,  in  the  noise  RABAM,  by  replacing  (51)- 
(53)  with  (64)— (66): 


Z  Sjrrijj 
J 


+  yffitli 


(64) 


=  -a, 


bj  -  S  S.niij 


+  y/Tjfij 


" hi  =  ~mij  +  SiSj  +  'JTi,n,J 


(65) 

(66) 


where  again  (67)  can  be  replaced  with  the  other  unsuper¬ 
vised  learning  laws  discussed  above  with  appropriate  ad¬ 
ditional  constraints. 


A.  RABAM  Annealing  Theorem 

The  RABAM  annealing  model  is  globally  stable,  and 
asymptotically  stable  for  monotone  increasing  signal 
functions  and  positive  amplification  functions,  in  which 
case  the  mean-squared  activation  and  synaptic  velocities 
decrease  to  their  temperature-scaled  instantaneous  "vari¬ 
ances”  exponentially  fast: 


Gradient  systems  are  globally  stable.  The  above  theo¬ 
rems  arc  an  extension  of  this  general  Lyapunov  fact.  For 
example,  Cohen  and  Grossbcrg  [6]  showed  that  their  sym¬ 
metric  nonlcaming  autoassociative  system  can  be  written 
in  pseudogradient  form  for  monotone  increasing  signal 
functions  and  positive  amplification  functions. 

Geman  and  Hwang  (8)  recently  showed  that  stochastic 
gradient  systems  with  scaled  additive  Brownian  diffusions 
(noise)  perform  simulated  annealing  in  a  weak  sense.  The 
gradient  is  formed  from  a  cost  function  to  be  searched  by 
scaled  random  hill  climbing.  If  the  noise  is  initially  scaled 
high  enough  ('o  a  physically  unrealizable  size),  then  grad¬ 
ually  decreasing  the  nonnegative  "temperature"  T(t) 
scaling  factor  can  bounce  the  system  state  out  of  local 
minima  and  trap  it  in  global  minima.  The  convergence, 
though,  must  proceed  exponentially  slowly  and  is  only 


E(x])lT,o],  E{y])  1  7>;,  E{m],)  1  T„a\. 

(67) 

Proof.  The  proof  largely  duplicates  the  proofs  of  the 
RABAM  theorem  and  RABAM  noise  suppression  theo¬ 
rem.  Again  E(L)  is  a  sufficiently  smooth  Lyapunov  func¬ 
tion  that  allows  time  differentiation  of  the  integrand.  When 
the  diffusion  or  noise  RABAM  annealing  equations  are 
used  to  eliminate  activation  and  synaptic  velocities  in  the 
time-differentiated  Lyapunov  function,  the  resulting  tem¬ 
perature  functions  that  occur  can  be  factored  outside  all 
expectations.  The  nonnegativity  of  the  temperature  func¬ 
tions  keeps  them  from  affecting  the  structure  of  expanded 
time  derivative  of  E(L).  The  random  weight  functions 
S' /a  arc  assumed  sufficiently  well  behaved  to  keep  the 
expectations  in  which  they  occur  nonnegative.  The  above 
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(he  incan-squared  velocity  E(xj)  is  bounded  below  by 
7, a,2.  Then.  (62)  is  generalized  to 


i:(L)  =  -  Y,  t: 


s: 


1  (*,  -  7>?) 


- 


5,' 


(68) 


nealing  schedules).  The  extra  term  involves  the  derivative 
of  this  constant  with  respect  to  the  corresponding  random 
activation  or  synapse.  Thus  RABAM  models  enjoy  the 
best  of  both  stochastic-calculus  worlds.  They  maintain  the 
familiar  chain  rule  of  Stratonovich  stochastic  dynamical 
systems  and  inherit  the  better-explored  properties  of  Ito 
stochastic  dynamical  systems.  For  instance,  all  RABAM 
solution  processes  are  Markov  processes.  This  promises 
a  new  approach  to  nonlinear  stochastic  optimal  estimation 
and  control. 


Q.E.D. 

The  RABAM  annealing  theorem  is  a  nonlinear  and  con¬ 
tinuous  generalization  of  Boltzmann  machine  learning 
(40],  provided  learning  is  Hebbian  and  very  slow.  The 
Boltzmann  machine  uses  discrete  symmetric  additive  au- 
toassociative  dynamics.  Binary  neurons  are  annealed  dur¬ 
ing  periods  of  Hebbian  and  anti-Hebbian  learning.  Here 
Hebbian  learning  corresponds  to  (66)  with  7 ,y(/)  =  0  for 
all  I.  Anti-Hebbian  learning  further  replaces  the  Hebh 
product  S,Sj  in  (66)  with  the  negative  product  —5,5..  Anti- 
Hebbian  learning  (during  “free-running”  training  [40]) 
can  in  principle  destabilize  a  RABAM  system.  This  is  less 
likely  to  occur,  though,  the  slower  the  anti-Hebbian  learn¬ 
ing.  (The  activation  terms  in  the  time  derivative  of  E(L) 
stay  negative  and  can  outweigh  the  possibly  positive  anti- 
Hebbian  terms,  even  if  learning  is  fast.)  Incidental  insta¬ 
bility  perhaps  is  not  even  a  problem  in  this  phase  of  an¬ 
nealing,  since  the  intention  is  to  undo  some  of  the  learn¬ 
ing  in  the  “environmental”  annealing  phase.  The 
fundamental  distinction  between  unsupervised  RABAM 
learning  and  temperature-supervised  annealing  learning  is 
how  noise  is  treated.  Simulated  annealing  systems  search 
or  leant  with  noise.  Unsupervised  RABAM  systems  learn 
despite  noise.  During  “cooling,"  the  continuous  anneal¬ 
ing  schedules  define  the  flow  of  RABAM  equilibria  in  the 
product  state  space  of  continuous  nonlinear  random  pro¬ 
cesses.  Equation  (67)  implies  that  no  finite  temperature 
value,  however  large,  can  destabilize  a  RABAM. 

Finally,  the  proofs  of  the  above  RABAM  theorems  re¬ 
peatedly  use  the  familiar  chain  rule  of  differential  calcu¬ 
lus.  In  general,  the  chain  rule  does  not  apply  to  systems 
of  nonlinear  stochastic  differential  equations,  at  least  not 
in  (he  general  case  where  each  nonlinear  parameter  is  it¬ 
self  a  stochastic  process.  This  is  the  general  setting  for 
the  Ito  calculus.  One  exception  is  the  related  Stratonovich 
calculus,  which  defines  a  stochastic  integral  (an  integral 
defined  with  respect  to  a  random  measure  (4 1  (  with  as 
lightly  different  partitioning  of  the  time  interval.  The  Stra¬ 
tonovich  calculus  includes  the  classical  chain  rule,  though 
in  general  at  the  expense  of  possessing  non-Markovian 
solution  processes. 

Maybcck  (35]  shows  that,  with  probability  one.  the  Ito 
stochastic  differential  equals  the  Stratonovich  stochastic 
diffcrential  plus  a  term  involving  the  nonlinear  random 
scaling  factor  on  the  underlying  Brownian  diffusion.  The 
two  differentials  and  corresponding  integrals  arc  equal 
when  this  extra  term  is  zero.  Tht*  is  fortunately  always 
true  for  RABAM  systems  since  noise  terms  are  scaled  with 
constants  or  sequences  of  consunts  (deterministic  an- 


VIII.  Conclusions 

The  RABAM  model  unifies  many  popular  feedforward 
and  feedback  unsupervised  learning  systems  and  extends 
them  to  the  more  realistic,  and  more  complex,  random 
process  domain.  Unsupervised  learning  is  structurally  sta¬ 
ble  for  wide  families  of  nonlinear  feedback  dynamical 
systems.  This  holds  for  the  popular  signal  Hcbb  and  com¬ 
petitive  learning  feedback  systems  under  quite  general 
conditions.  It  holds  to  a  lesser  extent  for  the  largely  unex¬ 
plored  signal-velocity  learning  feedback  systems  that 
adapt  with  differential  Hebb  or  differential  competitive 
laws.  Pulse-coded  [  10),  [II]  signal  functions  augment  the 
class  of  feedback  systems  that  can  stably  leam  with  the 
differential  Hebb  and  differential  competitive  laws,  since 
in  this  case  they  give  back,  respectively,  signal  Hebb  and 
competitive  learning  behavior  much  of  the  time.  The 
pulse-coding  framework  also  promises  new  engineering 
approaches  to  implementing  adaptive  networks,  perhaps 
with  sinusoidal  techniques,  as  well  as  suggesting  new 
roles  for  signal-velocity  synaptic  mechanisms  in  real 
neural  systems.  The  feedback  in  these  stable  dynamical 
systems  can  always  be  eliminated  to  produce  unsuper¬ 
vised  feedforward  systems  that  stably  leam  with  Hebbian, 
competitive,  or  signal-velocity  learning  laws. 

The  stability  of  RABAM  models  yields  the  structural 
stability  of  ABAM  models.  From  an  engineering  perspec¬ 
tive,  this  means  we  can  more  confidently  build  large-scale 
ABAM  networks  with  electrical,  optical,  electrooptic,  and 
perhaps  other  (molecular,  fluid,  plasma,  polymer,  etc.) 
devices. 

For  the  ncurobiologist,  the  structural  stability  of  ABAM 
models  suggests  that  at  least  some  of  the  consistent  criti¬ 
cism  that  neural  models  arc  “unrealistic"  is  unfounded. 
The  many  intricate  neuronal  and  molecular  properties  that 
the  ncurobiologist  studies,  and  finds  missing  in  neural 
network  models,  arc  modeled  in  RABAM  systems  as  ran¬ 
dom  unmodeled  effects.  The  RABAM  noise  suppression 
theorem  says  these  unmodcled  effects  are  ignored  by  the 
network’s  global  computations  almost  as  quickly  as  they 
are  encountered.  Like  many  quantum-level  effects  in  elec¬ 
trical  devices,  these  unmodcled  effects  simply  do  not  af¬ 
fect  the  structure  of  global  network  computations — so  long 
as  they  arc  net  random  effects. 

How  plausible  is  this?  Some  unmodcled  effects  of 
couisc  depend  on  neuronal  and  synaptic  behavior  and  so 
arc  not  accurately  modeled  as  independent  noise  pro¬ 
cesses,  though  perhaps  central  limit  (Gaussian)  effects 
emerge  from  the  interaction  of  many  such  processes. 
Many  correlated  effects  can  also  be  incorporated  as  slowly 
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varying  parameters  in  the  "signal”  part  of  the  RABAM 
model. 

In  general,  the  sheer  number  (sample  size)  of  unmod- 
clcd  cfl'cets  suggests  a  Brownian  approximation.  To  the 
extent  that  (lie  unmodeled  synaptic  and  neuronal  effects 
involve  many  independently  interacting  continuous  phe¬ 
nomena.  the  net  result  is  a  Brownian  diffusion,  as  as¬ 
sumed  by  RABAM  models.  This  is  because  finite-vari¬ 
ance  continuous  processes  with  independent  increments 
in  time  have  Gaussian  increments  |35|,  and  hence  give 
rise  to  a  Brownian  diffusion. 
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So  far  as  the  laws  of  mathematics  refer  to  reality,  they  are  not  certain.  And  so  far  as 
they  are  certain,  they  do  not  refer  to  reality. 

Albert  Einstein 


Fuzziness  is  explored  as  an  alternative  to  randomness  for  describing  uncertainty.  The  new  sets-as- points 
geometric  view  of  fuzzy  sets  is  developed.  This  view  identifies  a  fuzzy  set  with  a  point  in  a  unit 
hypercube  and  a  nonfuzzy  set  with  a  vertex  of  the  cube.  Paradoxes  of  two-valued  logic  and  set  theory, 
such  as  Russell's  paradox,  correspond  to  the  midpoint  of  the  fuzzy  cube.  The  fundamental  questions  of 
fuzzy  theory — How  fuzzy  is  a  fuzzy  set?  How  much  is  one  fuzzy  set  a  subset  of  another? — are  answered 
geometrically  with  the  Fuzzy  Entropy  Theorem,  the  Fuzzy  Subsethood  Theorem,  and  the  Entropy- 
Subsethood  Theorem.  A  new  geometric  proof  of  the  Subsethood  Theorem  is  given,  a  corollary  of  which 
is  that  the  apparently  probabilistic  relative  frequency  nJN  turns  out  to  be  the  deterministic  subsethood 
S(X.A),  the  degree  to  which  the  sample  space  X  is  contained  in  its  subset  A.  So  the  frequency  of 
successful  trials  is  viewed  as  the  degree  to  which  all  trials  are  successful.  Recent  Bayesian  polemics 
against  fuzzy  theory  are  examined  in  light  of  the  new  sets-as-points  theorems. 

INDEX  TERMS:  Probability  Theory,  fuzzy  set  theory,  fuzzy  subsethood.  geometry  of  fuzzy  sets. 


1.  FUZZINESS  IN  A  PROBABILISTIC  WORLD 


Is  uncertainty  the  same  as  randomness?  If  we  are  not  sure  about  something,  is  it 
only  up  to  chance?  Do  the  notions  of  likelihood  and  probability  exhaust  our 
notions  of  uncertainty? 

Many  people,  trained  in  probability  and  statistics,  believe  so.  Some  even  say  so, 
and  say  so  loudly.  These  voices  are  often  heard  in  the  Bayesian  camp  of  statistics, 
where  probability  is  viewed,  not  as  a  frequency  or  other  objective  testable 
quantity,  but  as  a  subjective  state  of  knowledge. 

Bayesian  physicist  E.  T.  Jaynes  says6  that 

any  method  of  inference  in  which  we  represent  degrees  of  plausibility  by  real  numbers,  is  necessarily 
cither  equivalent  to  Laplace’s  [probability],  or  inconsistent. 

He  claims  physicist  R.  T.  Cox3  has  proven  this  as  a  theorem,  a  claim  we  examine 
below. 

More  recently,  Bayesian  statistician  Dennis  Lindlev13  issued  an  explicit 
challenge. 
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probability  is  the  o.ily  sensible  description  of  uncertainty  and  is  adequate  for  all  problems  involving 
uncertainty.  All  other  methods  are  inadequate. 

Lindley  directs  his  challenge  in  large  part  at  fuzzy  theory,  the  theory  that  all 
things  admit  degrees,  but  admit  them  deterministically.  This  article  accepts  the 
probabilist’s  challenge  from  the  fuzzy  viewpoint — admitting  but  ignoring  other 
approaches  to  uncertainty,  such  as  Dempster-Shafer  belief  function  theory — by 
defending  fuzziness  from  new  geometric  first  principles  and  by  questioning  the 
reasonableness  and  the  axiomatic  status  of  randomness.  The  new  view  is  the  sets- 
as-points  view11  of  fuzzy  sets:  A  fuzzy  set  is  a  point  in  a  unit  hypercube  and  a 
nonfuzzy  set  is  a  corner  of  the  hypercube. 

There  are  conceptual  and  theoretical  differences  between  randomness  and 
fuzziness.  Some  can  be  illustrated  with  examples.  Some  can  be  proven  with 
theorems,  as  we  show  below. 

There  are  also  many  similarities.  The  chief,  but  superficial,  similarity  is  that  both 
systems  describe  uncertainty  with  numbers  in  the  unit  interval  [0, 1].  This 
ultimately  means  that  both  systems  describe  uncertainty  numerically.  The  struc¬ 
tural  similarity  is  that  both  systems  combine  sets  and  propositions  associatively, 
commutatively,  and  distributively.  The  key  distinction  concerns  how  the  systems 
deal  simultaneously  with  a  thing  A  and  its  opposite  Ac. 

Questions  raise  doubt,  and  doubt  suggests  room  for  change.  So  to  commence 
the  exposition,  consider  the  following  two  questions,  one  fuzzy  and  the  other 
probabilistic 

i)  Is  it  always  and  everywhere  true  that  AnAe  —  0? 

ii)  Who  can  derive  the  conditional  probability  operator 

ns\A)=^f^i 


The  second  question  may  appear  less  fundamental  than  the  first  question,  which 
asks  whether  fuzziness  exists.  The  Entropy-Subsethood  Theorem  below  shows  that 
the  first  and  second  questions  are  connected:  How  fuzzy  a  fuzzy  set  A  is  can  be 
measured  by  how  much  the  superset  A  u  Ac  is  a  subset  of  its  own  subset  AnAc,  a 
paradoxical  relationship  unique  to  fuzzy  theory.  In  contrast,  in  probability  theory 
this  state  of  affairs  is  impossible  (has  zero  probability):  P(Ar\Ac\A\j  A'')^ 
P(0\X)=O ,  where  X  is  the  sample  space  or  “sure  event”  and  the  empty  set  0  is 
the  “impossible  event”. 

The  conditioning  or  subsethood  in  the  second  question  is  at  the  heart  of 
Bayesian  probabilistic  systems.  The  absence  of  a  first-principles  derivation  of 
P{B\A)  in  itself  may  be  acceptable.  One  simply  agrees  to  take  the  ratio 
relationship  as  an  axiom.  The  problem  is  that  the  new  sets-as-points  view  of  fuzzy 
sets  derives  its  conditioning  operator  as  a  theorem  from  first  principles.  The 
history  of  science  suggests  that  systems  that  hold  theorems  as  axioms  continue  to 
evolve. 

The  first  question  asks  whether  the  law  of  noncontradiction — one  of  Aristotle’s 
three  “laws  of  thought”  along  with  the  laws  of  excluded  middle,  AvAc  =  X,  and 
identity,  A  =  A — can  be  violated.  Set  fuzziness  occurs  when,  and  only  when,  it  is 
violated.  Classical  logic  and  set  theory  assume  that  the  law  of  noncontradiction. 
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and  equivalently  the  law  of  excluded  middle,  is  never  violated.  That  is  what  makes 
the  classical  theory  black  or  white.  Fuzziness  begins  where  Western  logic  ends. 


2.  RANDOMNESS  VS.  AMBIGUITY:  WHETHER  VS.  HOW  MUCH 

Fuzziness  describes  event  ambiguity.  It  measures  the  degree  to  which  an  event 
occurs,  not  whether  it  occurs.  Randomness  describes  the  uncertainty  of  event 
occurrence.  An  event  occurs  or  not,  and  you  can  bet  on  it.  At  issue  is  the  nature  of 
the  occurring  event:  whether  it  itself  is  uncertain  in  any  way,  in  particular  whether 
it  can  be  unambiguously  distinguished  from  its  opposite. 

Whether  an  event  occurs  is  “random”.  To  what  degree  it  occurs  is  fuzzy. 
Whether  an  ambiguous  event  occurs — as  when  we  say  there  is  20  %  chance  of  light 
rain  tomorrow — involves  compound  uncertainties,  the  probability  of  a  fuzzy  event. 

In  practice  we  regularly  apply  probabilities  to  fuzzy  events:  small  errors,  satisfied 
customers,  A  students,  safe  investments,  developing  countries,  noisy  signals,  spiking 
neurons,  dying  cells,  charged  particles,  nimbus  clouds,  planetary  atmospheres, 
galactic  clusters.  We  understand  that,  at  least  around  the  edges,  some  satisfied 
customers  can  be  somewhat  unsatisfied,  some  A  students  might  equally  be  B  + 
students,  some  stars  are  as  much  in  a  galactic  duster  as  out  of  it.  Events  can  more 
or  less  smoothly  transition  to  their  opposites,  making  classification  hard  near  the 
midpoint  of  the  transition.  But  in  theory — in  formal  descriptions  and  in  text¬ 
books — the  events  and  their  opposites  are  black  and  white.  A  hill  is  a  mountain  if 
it  is  at  least  x  meters  tall,  not  a  mountain  if  it  is  one  micron  less  than  x  in  height. 
Every  molecule  in  the  universe  either  is  or  is  not  a  pencil  molecule,  even  those 
hovering  above  the  pencil’s  surface. 

Consider  some  further  examples.  The  probability  that  this  essay  gets  published 
is  one  thing.  The  degree  to  which  it  gets  published  is  another.  The  essay  may  be 
edited  in  hundreds  of  ways.  Or  the  essay  may  be  marred  with  typographical 
errors,  and  so  on. 

Question:  Does  quantum  mechanics  deal  with  the  probability  that  an  unambi¬ 
guous  electron  occupies  spacetime  points?  Or  does  it  deal  with  the  degree  to  which 
an  electron,  or  an  electron  smear,  occurs  at  spacetime  points?  Does  |i^j 1  dV 
measure  the  probability  that  a  random-point  electron  occurs  in  infinitesimal 
volume  dVl  Or12  does  it  measure  the  degree  to  which  a  deterministic  electron 
cloud  occurs  in  dVl  Different  interpretation,  different  universe.  Perhaps  ^'en 
existence  admits  degrees. 

Suppose  there  is  50%  chance  that  there  is  an  apple  in  the  refrigerator  (electron 
in  a  cell12).  That  is  one  state  of  affairs,  perhaps  arrived  at  through  frequency 
calculations  or  a  Bayesian  state  of  knowledge.  Now  suppose  there  is  half  an  apple 
in  the  refrigerator.  That  is  another  state  of  affairs.  Both  states  of  affairs  are 
superficially  equivalent  in  terms  of  their  numerical  uncertainty.  Yet  physically, 
ontologically,  they  are  distinct.  One  is  “random”,  the  other  fuzzy. 

If  events  are  assumed  unambiguous,  as  in  bails-in-urns  experiments,  there  is  no 
fuzziness.  Only  randomness  remains.  But  when  discussing  the  physical  universe, 
every  assertion  of  event  ambiguity  or  nonambiguity  is  an  empirical  hypothesis. 
This  is  habitually  overlooked  when  applying  probability  theory.  Years  of  such 
oversight  are  perhaps  responsible  for  the  deeply  entrenched  sentiment  that 
uncertainty  is  randomness,  and  randomness  alone.  The  silent  assumption  of 
universal  nonambiguity  is  akin  to  the  pre-relativistic  assumption  of  an  uncurved 
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Figure  1  Inexact  oval.  Which  statement  better  describes  the  situation:  (a)  “It  is  probably  an  ellipse"  or 
(b)  “It  is  a  fuzzy  ellipse"? 


universe.  AnAc-0  is  the  “parallel  postulate”  of  classical  set  theory  and  logic, 
indeed  of  Western  thought. 

If  fuzziness  is  a  unique  type  of  uncertainty,  if  fuzziness  exists,  the  physical 
consequences  are  universal,  and  the  sociological  consequence  is  startling:  scientists, 
especially  physicists,  have  overlooked  an  entire  mode  of  reality. 

Fuzziness  is  a  type  of  deterministic  uncertainty.  Ambiguity  is  a  property  of 
physical  phenomena.  Unlike  fuzziness,  probability  dissipates  with  increasing 
information.  After  the  fact  “randomness”  looks  like  fiction.  (This  is  especially 
awkward  since  in  general  the  laws  of  science  are  time  reversible,  invariant  if  time  t 
is  replaced  with  time  —  t.  Where  does  the  randomness  go?)  Yet  there  is  as  much 
ambiguity  after  a  sample-space  experiment  as  before.  Increasing  information  tends 
to  specify  the  degrees  of  occurrence.  Even  if  science  had  run  its  course  and  all  the 
facts  were  in,  a  platypus  would  remain  only  roughly  a  mammal,  a  large  hill  only 
roughly  a  mountain,  an  oval  squiggle  only  roughly  an  ellipse.  Fuzziness  does  not 
require  that  God  play  dice. 

Consider  the  inexact  oval  in  Figure  1.  Does  it  make  more  sense  to  say  that  the 
oval  is  probably  a  circle  (or  ellipse),  or  that  it  is  a  fuzzy  ellipse?  There  is  nothing 
random  about  the  matter.  The  situation  is  deterministic:  All  the  facts  are  in.  Yet 
uncertainty  remains.  The  uncertainty  is  due  to  the  simultaneous  occurrence  of  two 
properties:  to  some  extent  the  inexact  oval  is  an  ellipse  and  to  some  extent  it  is 
not  an  ellipse. 

More  formally,  is  m^x),  the  degree  to  which  element  x  belongs  to  fuzzy  set  A, 
simply  the  probability  that  x  is  in  A?  Is  m/<(x)  =  Frob{xe/l}  true?  Cardinality- 
wise,  sample  spaces  cannot  be  too  big.  Else  a  positive  measure  cannot  be  both 
countably  additive  and  finite,  and  thus  a  probability  measure.  The  space  of  all 
possible  oval  figures  is  too  big,  since  there  are  more  of  these  than  real  numbers. 
Almost  all  sets  are  too  big  to  define  probabilities,  yet  fuzzy  sets  can  always  be 
defined. 

Prob{xe/l}  might  be  interpreted  as  the  probability  of  a  fuzzy  event,  the 
probability  that  element  x  belongs  to  fuzzy  set  A  with  degree  mA(x).  Rarely  indeed 
then  should  the  equality  Prob  {xe  A}  =mA(x)  occur. 
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But  this  is  not  the  intended  interpretation  of  the  assertion  Prob{xe/t}  =  m<4(x). 
Instead  set  A  is  not  fuzzy.  The  dement  x  either  is  or  is  not  an  element  of  set  A. 
We  do  not  know  which,  and  we  describe  this  uncertainty  with  the  probability 
Prob{x€/4}.  But  then  surely  Prob{xe/4}  ^mA(x).  For  example,  Prob{xe/4o/4'} 
=  0  and  Prob  {xeAvAc}  =  1  for  every  nonfuzzy  set  A.  Yet  m^„^c(x)>0  and 
mAwA<(x)<  1  for  every  properly  fuzzy  set  A. 

Probability  theory  is  a  chapter  in  the  book  of  finite  measure  theory.  Many 
probabilists  do  not  care  for  this  classification,  but  they  fall  back  upon  it  when 
defining  terms.7  How  reasonable  is  it  to  believe  that  finite  measure  theory — 
ultimately,  the  summing  of  nonnegative  numbers  to  unity — exhaustively  describes 
the  (quantum-mechanical)  universe?  Does  it  really  describe  any  thing? 

Surely  from  time  to  time  every  probabilist  wonders  whether  probability 
describes  anything  real.  From  Democritus  to  Einstein,  there  has  been  the  suspicion 
that,  as  David  Hume5  put  it, 

though  there  be  no  such  thing  as  chance  in  the  world,  our  ignorance  of  the  real  cause  of  any  event  has 
the  same  influence  on  the  understanding  and  begets  a  like  species  of  belief. 

When  we  model  noisy  processes  by  extending  differential  equations  to  stochastic 
differential  equations,  it  seems  we  introduce  the  formalism  only  as  a  working 
approximation  to  several  underlying  unspecified  processes,  processes  that  presum¬ 
ably  obey  deterministic  differential  equations.  In  this  sense  conditional  expec¬ 
tations  and  martingale  techniques  might  seem  reasonably  applied,  for  example,  to 
stock  options  or  commodity  futures  phenomena,  where  the  behavior  involved 
consists  of  aggregates  of  aggregates  of  aggregates.  The  same  techniques  seem  less 
reasonably  applied  to  quarks,  leptons,  and  void. 


3.  THE  UNIVERSE  AS  A  FUZZY  SET 

The  world,  as  Wittgenstein11  observed,  is  everything  that  is  the  case.  In  this  spirit 
we  can  summarize  the  ontological  case  for  fuzziness:  The  universe  consists  of  all 
subsets  of  the  universe.  The  only  subsets  of  the  universe  that  are  not  fuzzy  are  the 
constructs  of  classical  mathematics.  All  other  sets — sets  of  particles,  cells,  tissues, 
people,  ideas,  galaxies — in  principal  contain  elements  to  different  degrees.  Their 
membership  is  partial,  graded,  inexact,  ambiguous,  or  uncertain. 

The  same  universal  circumstance  holds  at  the  level  of  logic  and  truth.  The  only 
logically  true  or  false  statements — statements  S  with  truth  value  r(S)  in  (0, 1} — are 
tautologies,  theorems,  and  contradictions.  If  S  is  any  statement  about  the  universe, 
an  empirical  statement,  then  0<r(S)<l  holds  by  the  canons  of  scientific  method 
and  by  the  lack  of  a  single  demonstrated  factual  statement  S  with  r(S)=l  or 
t(S)  =  0.  That  is  the  thrust  of  Einstein’s  quote  above. 

Fuzziness  arises  from  the  ambiguity  between  a  thing  A  and  its  opposite  ac.  If  we 
do  not  know  A  with  certainty,  we  do  not  know  Ac  with  certainty  either.  Else  by 
double  negation  we  would  know  A  with  certainty.  This  produces  nondegenerate 
overlap:  AnA‘^0,  which  breaks  the  “law  of  noncontradiction”.  Equivalently,  this 
also  produces  nondegenerate  underlap:10  AkjA'^X,  which  breaks  the  “law  of 
excluded  middle".  Here  X  is  the  ground  set  or  universe  of  discourse.  Recall4  that 
these  laws  are  never  broken  in  probabilistic  or  stochastic  logics — P(A  and  not- 
/4)  =  0  and  P(A  or  not-/4)  =  l — even  though  they  are  broken  with  many,  perhaps 
most,  human  utterances.  Nor  are  probability  measures  allowed  to  take  such  fuzzy 
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sets  as  arguments.  The  sets  must  first  be  quantized,  rounded  off,  or  defuzzified  to 
the  nearest  nonfuzzy  set.  So  the  question  arises:  How  mathematically  natural  are 
fuzzy  sets? 


4.  THE  GEOMETRY  OF  FUZZY  SETS:  SETS  AS  POINTS 

It  helps  to  see  the  geometry  of  fuzzy  sets  when  discussing  fuzziness.  To  date  this 
visual  property  has  been  overlooked.  The  emphasis  has  instead  been  on  interpret¬ 
ing  fuzzy  sets  as  membership  functions,  mappings  mA  from  domain  X  to  range 
[0, 1].  But  functions  are  hard  to  visualize.  Membership  functions  are  often  pictured 
as  two-dimensional  graphs,  with  the  domain  X  misleadingly  represented  as  one¬ 
dimensional.  The  geometry  of  fuzzy  sets  involves  both  the  domain  2f  =  {x,f...,x„} 
and  the  range  [0,1]  of  mappings  mA\  Af-*[0, 1],  The  geometry  of  fuzzy  sets  is  a 
great  aid  in  understanding  fuzziness,  defining  fuzzy  concepts,  and  proving  fuzzy 
theorems.  Visualizing  this  geometry  may  by  itself  be  the  most  powerful  argument 
for  fuzziness. 

The  geometry  of  fuzzy  sets  is  revealed  by  asking  an  odd  question:  What  does 
the  fuzzy  power  set  F{2X),  the  set  of  all  fuzzy  subsets  of  X ,  look  like?  Answer  A 
cube.  What  does  a  fuzzy  set  look  like?  A  point  in  a  cube.  The  set  of  all  fuzzy 
subsets  is  the  unit  hypercube  /"  =  [0, 1]".  A  fuzzy  set  is  any  point11  in  the  cube  /". 
So  (X,  /")  is  the  fundamental  measurable  space  of  (finite)  fuzzy  theory.  The  theory 
of  fuzzy  sets — more  accurately,  the  theory  of  continuous  sets — can  be  taught  on  a 
Rubik’s  cube. 

Vertices  of  the  cube  /"  are  nonfuzzy  sets.  So  the  ordinary  power  set  2X,  the  set 
of  all  2"  nonfuzzy  subsets  of  X,  is  the  Boolean  n-cube  B":2x  =  Bn.  Fuzzy  sets  fill  in 
the  lattice  B "  to  produce  the  solid  cube  /":  F(2X)  =  I”. 

Consider  the  set  of  two  elements  X  =  {xj,x2}.  The  nonfuzzy  power  set  2X 
contains  four  sets:  2X  =  {0,  X,  (xj,  {x2}}.  These  four  sets  correspond  respectively 
to  the  four  bit  vectors  (00),  (11),  (10),  and  (01).  The  Is  and  0s  indicate  the 
presence  or  absence  of  the  ith  element  x,  in  the  subset.  More  abstractly,  each 
subset  A  is  uniquely  defined  by  one  of  the  two-valued  membership  functions 

Now  consider  the  fuzzy  subsets  of  X.  The  fuzzy  subset  A  =(35)  can  t>e  viewed  as 
one  of  the  continuum-many  continuous-valued  membership  functions  mA:X-> 
[0, 1].  Indeed  this  is  the  classical  Zadeh16  sets-as-functions  definition  of  fuzzy  sets. 
In  this  example  element  x,  belongs  to,  or  fits  in,  subset  A  a  little  bit — to  degree  3. 
Element  x2  has  more  membership  than  not  at  g.  Analogous  to  the  bit  vector 
representation  of  finite  (countable)  sets,  we  say  that  A  is  represented  by  the  fit 
vector  (35).  The  dement  mA(Xi)  is  the  ith  fit10  or  fuzzy  unit  value.  The 
set-as-points  view  then  geometrically  represents  the  fuzzy  subset  A  as  a  point  in  /2, 
the  unit  square,  as  in  Figure  2. 

The  midpoint  of  the  cube  f"  is  maximally  fuzzy.  All  its  membership  values  are  ]. 
The  midpoint  is  unique  in  two  respects.  First,  the  midpoint  is  the  only  set  A  that 
not  only  equals  its  own  opposite  Ac  but  equals  its  own  overlap  and  underlap  as 
well: 


A  =  A  n  Ac  —  A  kj  Ac  =  Ac . 


Second,  the  midpoint  is  the  only  point  in  the  cube  /"  that  is  equidistant  to  each 
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Figure  2  Sets  as  points.  The  fuzzy  subset  A  is  a  point  in  the  unit  2-cube  with  coordinates  or  fit  values 
(4  i).  The  first  element  xt  fits  in  or  belongs  to  A  to  degree  the  element  x2  to  degree  J.  The  cube 
consists  of  all  possible  fuzzy  subsets  of  two  elements  {x„x2}.  The  four  comers  represent  the  power  set 
2*  of  {x„x2}. 


of  the  2"  vertices  of  the  cube.  The  nearest  corners  are  also  the  farthest.  This 
metrical  relationship  is  evident  in  Figure  2. 

Fuzzy  sets  are  combined16  pairwise  with  minimum,  maximum,  and  order 
reversal,  as  are  nonfuzzy  sets.  Fuzzy  set  intersection  is  defined  fitwise  by  pairwise 
minimum  (picking  the  smaller  of  the  two  elements),  union  by  pairwise  maximum, 
and  complementation  by  order  reversal.  For  example: 

.  A=(l  0.8  0.4  0.5) 


B  =(0.9  0.4  0  0.7) 


AnB  =(0.9  0.4  0  0.5) 

AuB  =  (l  0.8  0.4  0.7) 

Ac=( 0  0.2  0.6  0.5) 

AnAc  =  (0  0.2  0.4  0.5) 

AuAc  =  (l  0.8  0.6  0.5). 

Note  that  the  overlap  fit  vector  AnAr  is  not  the  vector  of  all  zeroes  and  the 
underlap  fit  vector  AuAc  is  not  the  vector  of  all  ones.  This  is  true  of  all  properly 
fuzzy  sets,  all  points  in  / "  other  than  vertex  points.  Indeed  the  min-max  definitions 
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Figure  3  Completing  the  fuzzy  square.  The  fuzzier  A  is,  the  closer  A  is  to  the  midpoint  of  the  fuzzy 
cube.  As  A  approaches  the  midpoint,  all  four  points — A,  Ar,  Ar\A‘,  and  AkjA' — contract  to  the 
midpoint.  The  less  fuzzy  A  is,  the  closer  A  is  to  the  nearest  vertex.  As  A  approaches  the  vertex,  all  four 
points  spread  out  to  the  four  vertices  and  the  bivalent  power  set  2X  is  recovered. 


give  at  once  the  following  fundamental  characterization  of  fuzziness  as  non¬ 
degenerate  overlap  and  nonexhaustive  underlap. 

Proposition  A  is  properly  fuzzy  if  and  only  if  AnAc^0  and  if  and  only  if 
AkjA'^X. 

An  illustration  of  this  fundamental  proposition  is  what  we  might  call  completing 
the  fuzzy  square.  Consider  again  the  two-dimensional  fuzzy  set  A  defined  by  the  fit 
vector  (j|).  The  corresponding  overlap  and  underlap  sets  can  be  found  by  first 
finding  the  complement  set  Ac  and  then  combining  the  fit  vectors  pairwise  with 
minimum  and  with  maximum: 


A=C>  l) 

Ac=(l  £) 

AnA‘=(\  {) 

AkjA‘=(1  & 

The  sets-as-points  view  shows  that  these  four  points  in  the  unit  square  hang 
together,  indeed  move  together,  in  a  very  natural  way.  Consider  the  geometry  of 
Figure  3. 

In  Figure  3  the  four  fuzzy  sets  involved  in  the  fuzziness  of  set  A — the  sets  A,  A\ 
Ar\A‘,  and  AvAc — contract  to  the  midpoint  as  A  becomes  maximally  fuzzy  and 
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expand  out  to  the  Boolean  corners  of  the  cube  as  A  becomes  minimally  fuzzy.  The 
same  contraction  and  expansion  occurs  in  n  dimensions  for  the  2"  fuzzy  sets 
defined  by  all  combinations  of  mA(xt)  and  mA.{xx mA(x„)  and  mA<(xm). 

At  the  midpoint  nothing  is  distinguishable.  At  the  vertices  everything  is 
distinguishable.  These  extremes  represent  the  two  ends  of  the  spectrum  of  logic 
and  set  theory.  In  this  sense  the  midpoint  is  the  black  hole  of  set  theory. 


5.  PARADOX  AT  THE  MIDPOINT 

The  midpoint  is  full  of  paradox.  It  is  forbidden  to  classical  logic  and  set  theory. 
Where  midpoint  phenomena  appear  in  Western  thought  they  are  invariably 
labeled  “paradoxes”  or  denied  altogether.  Midpoint  phenomena  include  the  half- 
empty  and  half-full  cup,  the  Taoist  Yin-Yang,  the  liar  from  Crete  who  said  that  all 
Cretans  are  liars,  Bertrand  Russell’s  set  of  all  sets  that  are  not  members  of 
themselves,  and  Russell’s  barber. 

Russell’s  barber  is  a  bewhiskered  man  who  lives  in  a  town  and  shaves  a  man  if 
and  only  if  he  does  not  shave  himself.  So  who  shaves  the  barber?  If  he  shaves 
himself,  then  by  definition  he  does  not.  But  if  he  does  not  shave  himself,  then  by 
definition  he  does.  So  he  does  and  he  does  not — contradiction  (“paradox”). 
Gaines4  observed  that  this  paradoxical  circumstance  can  be  numerically  inter¬ 
preted  as  follows. 

Let  S  be  the  proposition  that  the  barber  shaves  himself  and  not-S  that  he  does 
not  Then  since  S  implies  not-S  and  not-S  implies  S,  the  two  propositions  are 
logically  equivalent  S  = not-S.  Equivalent  propositions  have  the  same  truth  values: 

r(S)=f(not-S) 

=  1  -KS). 

Solving  for  t(S)  gives  the  midpoint  point  of  the  truth  interval  (the  one-dimensional 
cube  [0,1]):  t{S)={.  The  midpoint  is  equidistant  to  the  vertices  0  and  1.  In  the. 
bivalent  (two-valued)  case,  roundoff  is  impossible  and  paradox  occurs. 

In  bivalent  logic  both  statements  S  and  not-S  must  have  truth  value  zero  or 
unity.  The  fuzzy  resolution  of  the  paradox  only  uses  the  fact  that  the  truth  values 
are  equal.  It  does  not  in  principle  constrain  their  range.  The  midpoint  value  ] 
emerges  from  the  structure  of  the  problem  and  the  order- re  versing  effect  of 
negation. 

The  paradoxes  of  classical  set  theory  and  logic  are  part  of  the  price  one  pays  for 
an  arbitrary  insistence  on  bivalence.  This  insistence  is  often  made  in  the  name  of 
science.  In  the  end,  though,  if  is  simply  a  cultural  preference,  a  reflection  of  an 
educational  predilection  that  goes  back  at  least  to  Aristotle.  It  takes  great  faith  to 
insist  on  bivalence  in  the  face  of  both  bivalent  contradictions  (paradoxes)  and  a 
consistent  fuzzy  alternative. 

Put  another  way,  fuzziness  shows  that  there  are  limits  to  logical  certainty.  We 
can  no  longer  assert  the  laws  of  noncontradiction  and  excluded  middle  for 
sure — and  for  free. 

The  fuzzy  theorist  must  explain  why  so  many  people  have  been  wrong  for  so 
long.  We  now  have  the  machinery  to  offer  an  explanation.  The  reason  is  that 
rounding  off,  quantizing,  simplifies  life  and  often  costs  little.  We  agree  to  call  empty 
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Figure  4  The  count  M(A)  of  A  is  the  fuzzy  Hamming  norm  It1  norm)  of  the  vector  drawn  from  the 
origin  to  A. 


the  near  empty  cup,  and  present  the  large  pulse  and  absent  the  small  pulse.  We 
round  off  points  inside  the  fuzzy  cube  to  the  nearest  vertex.  This  roundoff  heuristic 
works  fine  as  a  first  approximation  to  describing  the  universe  until  we  get  near  the 
midpoint  of  the  cube.  These  phenomena  are  harder  to  roundoff.  In  the  logically 
extreme  case,  at  the  midpoint  of  the  cube,  the  procedure  breaks  down  completely 
because  every  vertex  is  equally  close.  Hands  are  thrown  up  and  paradox  declared. 

Faced  with  midpoint  phenomena,  the  fuzzy  skeptic  is  in  the  same  position  as  the 
flat-earther,  who  denies  that  the  earth’s  surface  is  curved,  when  she  stands  at  the 
north  pole,  looks  at  her  compass  and  wants  to  go  south. 

6.  COUNTING  WITH  FUZZY  SETS 

How  big  is  a  fuzzy  set?  The  size  or  cardinality  of  A,  M{A),  is  the  sum  of  the  fit 
values  of  A: 


M(A)=  £ 

c=  l 


The  count  of  /4=(jj)  is  M(A)  =  j  +  z  =  -n-  The  cardinality  measure  M  is 
sometimes  called  the  sigma-count.11  The  measure  M  generalizes9  the  classical 
counting  measure  of  combinatorics  and  measure  theory.  (So  (X,T,M)  is  the 
fundamental  measure  space  of  fuzzy  theory.)  In  general  the  measure  M  does  not 
give  integer  values. 

The  measure  M  has  a  natural  geometric  interpretation  in  the  sets-as-points 
framework.  It  is  the  magnitude  of  the  vector  drawn  from  the  origin  to  the  fuzzy 
set,  as  illustrated  in  Figure  4. 
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Consider  the  lp  distance  between  fuzzy  sets  A  and  B  in  /" 


lp(A,B)  =  p /I  K(x,)-mfl(xi)|'1 


Is*  1 


where  l^p^co.  The  l2  distance  is  the  physical  Euclidean  distance  actually 
illustrated  in  the  figures.  The  simplest  distance  is  the  ll  or  fuzzy  Hamming  distance, 
the  sum  of  the  absolute  fit  differences.  We  shall  use  fuzzy  Hamming  distance 
throughout,  though  all  results  admit  a  general  lp  formulation.  Using  the  fuzzy 
Hamming  distance  the  count  M  can  be  rewritten  as  the  desired  I1  norm: 


M{A)=Y. 

i=  1 


=ZK«(*.-)-o| 

« 


~  X  |ma(*«)  —  me)(xi)\ 

i 


=/V,0). 


7.  THE  FUZZY  ENTROPY  THEOREM 

How  fuzzy  is  a  fuzzy  set?  Fuzziness  is  measured  by  a  fuzzy  entropy  measure. 
Entropy  is  a  generic  notion.  It  need  not  be  probabilistic.  Entropy  measures  the 
uncertainty  of  a  system  or  message.  A  fuzzy  set  is  a  type  of  system  or  message.  Its 
uncertainty  is  its  fuzziness. 

The  fuzzy  entropy  of  A,  E(A),  varies  from  0  to  1  on  the  unit  hypercube  Only 
the  cube  vertices  have  zero  entropy,  since  nonfuzzy  sets  are  unambiguous.  The 
cube  midpoint  uniquely  has  maximum  entropy  one.  Fuzzy  entropy  smoothly 
increases  as  one  moves  from  any  vertex  to  the  midpoint.  The  algebraic  require¬ 
ments  for  fuzzy  entropy  measures  can  be  found  in  K.lir.8 

Simple  geometric  considerations  lead10  to  a  ratio  form  for  the  fuzzy  entropy. 
The  closer  the  fuzzy  set  A  is  to  the  nearest  vertex  A„ettr,  the  farther  A  is  from  the 
farthest  vertex  Afar.  Opposite  the  long  diagonal  from  the  nearest  vertex  is  the 
farthest  vertex.  Let  a  denote  the  distance  ll(A,  Amear)  to  the  nearest  vertex  and  let  b 
denote  the  distance  ll{A,Afar)  to  the  farthest  vertex.  Then  the  fuzzy  entropy  is 
simply  the  ratio  of  a  to  b: 


ll(A,AneJ 

ll(A,AfJ- 


The  sets-as-points  interpretation  of  the  fuzzy  entropy  is  shown  in  Figure  5,  where 
/1=(U)-  ^  =  (01),  and  -4/a,  =  (10).  So  a  =  5  +  i  =  i7  and  b  =  j  +  |  =  So  E(A)  = 

7 

T7* 

Alternatively,  those  reading  this  in  a  room  can  imagine  that  the  room  is  the  unit 
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Figure  5  Fuzzy  entropy  E(A)  =  a/b  as  balance  between  distance  to  nearest  vertex  and  distance  to 
farthest  vertex. 


cube  /3  and  that  their  head  is  a  fuzzy  set  in  it.  Once  the  nearest  corner  of  the 
room  is  located,  the  farthest  corner  is  opposite  the  long  diagonal  emanating  from 
the  nearest  comer.  If  your  head  is  in  a  comer,  a=0  and  E(A)  —  0.  If  your  head  is 
in  the  metrical  center  of  the  room,  every  comer  is  nearest  and  farthest.  So  a=b 
and  E(A)  =  1. 

Since  overlap  and  underlap  characterize  fuzziness  we  can  expect  them  to  be 
involved  in  the  measure  of  fuzziness.  Careful  examination  of  the  completed  fuzzy 
square  in  Figure  3  shows  that  this  is  the  case.  For,  by  symmetry,  each  of  the  four 
points  A,  Ac,  AnAc,  and  AkjAc  is  equally  close  to  its  nearest  vertex.  The  common 
distance  is  a.  Similarly,  each  point  is  equally  far  from  its  farthest  vertex.  The 
common  distance  is  b.  One  of  the  first  four  distances  is  the  count  M(y4n/T).  One 
of  the  second  four  distances  is  the  count  M(A  u  A c).  This  gives  a  geometric  proof 
of  the  Fuzzy  Entropy  Theorem,10-11  which  states  that  fuzziness  consists  of  a 
balance  of  counted  violations  of  the  law  of  noncontradiction  and  counted 
violations  of  the  law  of  excluded  middle. 

Fuzzy  Entropy  Theorem 


E(A)  = 


MiAnA^j 

M(AkjAc)' 


An  algebraic  proof  is  straightforward.  The  geometric  proof  can  be  seen  by 
examining  the  completed  fuzzy  square  in  Figure  6. 

The  Fuzzy  Entropy  Theorem  explains  why  fuzziness  begins  where  Western  logic 
ends.  When  the  laws  of  noncontradiction  and  excluded  middle  are  obeyed,  overlap 
is  empty  and  underlap  is  exhaustive.  So  M{Ar\Ac)  =  0  and  M(Akj  Ac)  =  n,  and  thus 
E(A)  =  0. 
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Xi 

Figure  6  Geometry  of  the  Fuzzy  Entropy  Theorem.  By  symmetry  each  of  the  four  points  on  the 
completed  fuzzy  square  is  equally  close  to  its  nearest  vertex  and  equally  far  from  its  farthest  vertex. 


The  Fuzzy  Entropy  Theorem  also  provides  a  first-principles  derivation  of  the 
basic  fuzzy  set  operations  of  minimum  (intersection),  maximum  (union),  and  order 
reversal  (complementation)  proposed  in  1965  by  Zadeh16  at  the  inception  of  fuzzy 
theory.  (Lukasiewicz  first  proposed  these  operations  for  continuous  or  fuzzy  logics 
in  the  1920s.) 

For  the  fuzzy  theorist,  this  result  also  shows  that  triangular  norms  or  T-norms,8 
which  generalize  conjunction  or  intersection,  and  the  dual  triangular  co-norms  C, 
which  generalize  disjunction  or  union,  do  not  have  the  first-principles  status  of 
min  and  max.  For,  the  triangular  norm  inequalities, 

T{x,  y)  g  min  (x,  y)  ^  max  (x,  y)  <;  C(x,  y), 

show  that  replacing  min  with  any  T  in  the  numerator  term  M(A  n  A*)  can  only 
make  the  numerator  smaller.  Replacing  max  with  any  C  in  the  term  M{A  u  Ac) 
can  only  make  the  denominator  larger.  So  any  T  or  C  not  identically  min  or  max 
makes  the  ratio  smaller,  strictly  smaller  if  A  is  fuzzy.  Then  the  entropy  theorem 
does  not  hold  and  the  resulting  pseudo-entropy  measure  does  not  equal  unity  at 
the  midpoint,  though  it  continues  to  be  maximized  there.  This  can  be  easily  seen 
with  the  product  T-norm14  T(x,y)  =  xy  and  its  DeMorgan  dual  co-norm  C(x,y)  = 

1  —  T(1  —  x,  1  —  y)  =  x-Fy  — xy,  or  with  the  bounded  sum  T-norm  T{x,y)~ 
max(0,x  +  y  —  1)  and  DeMorgan  dual  C(x,y)  =  min(I,x  +  y).  The  Entropy  Theorem 
similarly  fails  in  general  if  the  negation  or  complementation  operator  N(x)=l—  x 
is  replaced  by  a  parameterized  operator  Na(x)=(l  —  x)/(  1  +ax)  for  nonzero  a>  —  1. 

As  an  aside,  note  that  all  probability  distributions,  all  sets  A  with  M(A)=  1,  in  /" 
form  a  n-1  dimensional  simplex  S".  In  the  unit  square  the  probability  simplex  is 
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the  negatively  sloped  diagonal  line.  In  the  unit  3-cube  it  is  a  solid  triangle.  In  the 
unit  4-cube  it  is  a  tetrahedron,  and  so  on  up. 

If  no  probabilistic  fit  value  p,  is  such  that  p,>i,  then  the  Fuzzy  Entropy 
Theorem  implies11  that  the  distribution  P  has  fuzzy  entropy  £(£)  =  l/(n  —  1).  Else 
£(P)  <  l/(n—  1).  So  the  probability  simplex  S"  is  entropically  degenerate  for  large 
dimensions  n.  This  result  also  shows  that  the  uniform  distribution  (l/n,...,  1/n) 
maximizes  fuzzy  entropy  on  S "  but  not  uniquely.  This  in  turn  shows  that  fuzzy 
entropy  differs  from  the  average-information  measure  of  probabilistic  entropy, 
which  is  uniquely  maximized  by  the  uniform  distribution. 

The  Fuzzy  Entropy  Theorem  also  implies  that,  analogous  to  log  I/p,  a  unit  of 
fuzzy  information  is  //(I—/)  or  (1—/)//  depending  on  whether  the  fit  value  / 
obeys  /^ior  j. 

The  event  x  can  be  ambiguous  or  clear.  It  is  ambiguous  if  /  is  approximately  j 
and  clear  if  /  is  approximately  1  or  0.  If  an  ambiguous  event  occurs,  is  observed, 
is  disambiguated,  etc.,  then  it  is  maximally  informative:  £(/)  =  £(^)=l.  If  a  clear 
event  occurs,  is  observed,  etc.,  it  is  minimally  informative:  £(/)  =  £(0)  =  £(l)=0. 
This  is  in  accord  with  the  information  interpretation  of  the  probabilistic  entropy 
measure  log  1/p,  where  the  occurrence  of  a  sure  event  (p=l)  is  minimally 
informative  (zero  entropy)  and  the  occurrence  of  an  impossible  event  (p  =  0)  is 
maximally  informative  (infinite  entropy). 


8.  THE  SUBSETHOOD  THEOREM 

Sets  contain  subsets.  A  is  a  subset  of  B,  denoted  A  c:  B,  if  and  only  if  every  element 
of  A  is  an  element  of  B.  The  power  set  2a  contains  all  of  B' s  subsets.  So, 
alternatively,1  A  is  a  subset  of  B  just  in  case  A  belongs  to  B’s  power  set: 

AczB  if  and  only  if  A  e  2s. 

The  subset  relation  corresponds  to  the  implication  relation  in  logic.  In  classical 
logic  truth  is  a  mapping  from  the  set  of  statements  {S}  to  truth  values: 
t:  (S}-+{0, 1}.  Consider  the  truth-tabular  definition  of  implication  for  bivalent 
propositions  P  and  Q: 


p.Q.  p-Q 


' 

0 

0 

1 _ 

1 

0 

1 

1 

1 

0 

0 

I 

1 

1 

The  implication  is  false  if  and  only  if  the  antecedent  P  is  true  and  the  consequent 
Q  is  false — when  “truth  implies  falsehood”. 

The  same  holds  for  subsets.  Representing  sets  as  bivalent  functions  mA:X-*{ 0, 1}, 
A  is  a  subset  of  B  if  there  is  no  element  .x  that  belongs  to  A  but  not  to  B,  or 
m^fx)^!  but  mB(x)  =  0.  This  membership-function  definition  can  be  rewritten  as 
follows: 


AczB  if  and  only  if  m/i{x)^mB(x)  for  all  x. 
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X=  (1  1) 


{X!  }  =  (  1  0) 


Figure  7  Fuzzy  power  set  F(2*)  as  a  hyper-rectangle  in  the  fuzzy  cube.  Side  lengths  are  the  fit  values 
/rigfx,).  The  size  or  volume  of  F(2“)  is  the  product  of  the  fit  values. 


Zadeh16  proposed  the  same  relation  for  defining  when  fuzzy  set  A  is  a  subset  of 
fuzzy  set  B.  We  refer  to  this  as  the  dominated  membership  function  relationship.  If 
A  =(0.3 00.7)  and  B— (0.4 0.7 0.9),  then  A  is  a  fuzzy  subset  of  B  but  B  is  not  a  fuzzy 
subset  of  A.  A  candidate  fuzzy  set  A  either  is  or  is  not  a  fuzzy  subset  of  B.  This  is 
the  problem.  The  relation  of  fuzzy  subsethood  is  not  fuzzy.  It  is  either  black  or 
white. 

The  sets-as-points  view  asks  a  geometric  question:  What  do  all  fuzzy  subsets  of 
B  look  like?  What  does  the  fuzzy  power  set  of  B — F( 2s),  the  set  of  all  fuzzy  subsets 
of  B — look  like?  The  dominated  membership  function  relationship  implies  that 
F( 2s)  is  the  hyper-rectangle  emanating  from  the  origin  with  side  lengths  given  by 
the  fit  values  m^x,).  Figure  7  displays  the  fuzzy  power  set  of  the  set  B  =  (\  3).  Of 
course  the  count  of  F(2B)  is  infinite  if  B  is  not  empty.  For  finite-dimensional  sets, 
the  size  of  F(2S)  can  be  taken11  as  the  Lebesgue  measure  or  volume  V(B),  the 
product  of  the  fit  values: 


i  =  1 

Figure  7  illustrates  that  F(2B)  is  not  a  fuzzy  set.  A  cube  point  A  either  is  or  is 
not  in  the  hyper-rectangle  F(2B).  Some  points  A  outside  the  hyper-rectangle  F(2B) 
resemble  subsets  of  B  more  than  other  points  do.  The  black-white  definition  of 
subsethood  ignores  this. 

The  natural  generalization  is  to  define  fuzzy  subsets  on  F(2B).  Some  sets  A 
belong  in  F(2S)  to  different  degrees.  The  abstract  membership  function  mF(IB)(A) 
can  be  any  number  in  [0, 1].  Degrees  of  subsethood  are  possible. 

Let  S{A,  B)  denote  the  degree  to  which  A  is  a  subset  of  B: 
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S(A,  B)  =  Degree(/4  <=  B) 

~mF(  2»)(/4)- 


S(- ,  •)  is  the  subsethood  measure.  S(- ,  *)  takes  values  in  [0, 1].  We  will  see  that  it 
is  the  fundamental,  unifying  structure  in  fuzzy  theory. 

The  current  task  is  to  measure  S(A,B).  We  will  first  present  an  earlier10  11 
algebraic  derivation  of  the  subsethood  measure  S(/l,  B).  We  will  then  present  a 
new,  more  fundamental,  geometric  derivation. 

The  algebraic  derivation  is  the  fit-violation  strategy.10  The  idea  is  that  you 
study  a  law  by  breaking  it.  Consider  the  dominated  membership  function 
relationship:  A<=B  if  and  only  if  mA{x)^mB(x)  for  all  x  in  X. 

Suppose  element  x„  violates  the  dominated  membership  function  relationship: 

Then  A  is  not  a  subset  of  B,  at  least  not  totally.  Suppose  further 
that  the  dominated  membership  inequality  holds  for  all  other  elements  x.  Only 
element  xv  violates  the  relationship.  For  instance,  X  may  consist  of  one  hundred 
values:  A'  =  {x1,...,x100}.  The  violation  might  occur,  say,  with  the  first  element: 
xt  =  x„.  Then  intuitively  A  is  largely  a  subset  of  B.  Suppose  now  that  X  contains  a 
thousand  elements,  or  a  trillion  elements,  and  only  the  first  element  violates  the 
dominated  membership  function  relationship.  Then  surely  A  is  overwhelmingly  a 
subset  of  B;  perhaps  S(>4,  B)= 0.999999999999. 

This  argument  suggests  we  should  count  fit  violations  in  magnitude  and 
frequency.  The  greater  the  violations  in  magnitude,  mA(x0)-mB{xv),  and  the 
greater  the  number  of  violations  relative  to  the  size  M{A)  of  A,  the  less  A  is  a 
subset  of  B ;  equivalently,  the  more  A  is  a  superset  of  B.  For,  both  intuitively  and 
by  the  dominated-membership  definition,  supersethood  and  subsethood  are  inver¬ 
sely  related: 


SUPERSETHOOD  {A,  B)  =  1  -  S(A,  B). 

The  simplest  way  to  count  violations  is  to  add  them.  If  we  sum  over  all  x,  the 
summand  should  equal  m^(xc)  — mB(xp)  when  this  difference  is  positive,  zero  when 
it  is  nonpositive.  So  the  summand  is  max(0,  m^(x)  —  mB(x)).  The  unnormalized 
count  is  therefore  the  sum  of  these  maxima: 

£  max  (0,  mA(x)  —  mB(x)). 

if  X 


The  simplest,  and  most  appropriate,  normalization  factor  is  the  count  of  A,  M(A). 
We  can  assume  M{A)>0  since  M(/4)  =  0  if  and  only  if  A  is  empty.  The  empty  set 
trivially  satisfies  the  dominated  membership  function  relationship.  So  it  is  a  subset 
of  every  set.  Normalization  gives  the  minimal  measure  of  nonsubsethood,  of 
supersethood: 


SUPERSETHOOD(A ,  B)  = 


max  (0,  mA(x)  -  mB(x)) 
M(A) 


Then  subsethood  is  the  negation  of  this  ratio.  This  gives  the  minimal  fit-violation 
measure  of  subsethood: 
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„  ,  m  ,  Lcmax«> .fnA(x)-mg(x)) 

S,A  B>=t - m ) - ■ 

The  subsethood  measure  may  appear  ungraceful  at  first  but  it  behaves  as  it 
should.  Observe  that  S(A,  B)  =  1  if  and  only  if  the  dominated  membership  function 
relationship  holds.  For  if  it  holds,  zero  violations  are  summed.  Then 
S(A,B)=  1  —0=  1.  If  S(/4, B)=  l,  every  numerator  summand  is  zero.  So  no  violation 
occurs.  At  the  other  exteme,  S(/4,B)  =  0  if  and  only  if  B  is  the  empty  set.  So  the 
empty  set  is  the  unique  set  without  subsets,  fuzzy  or  nonfuzzy.  Degrees  of 
subsethood  occur  between  these  extremes. 

The  subsethood  measure  also  relates  to  logical  implication.  Viewed  at  the  1- 
dimensional  level  of  fuzzy  logic,  and  so  ignoring  the  normalizing  count  (M(A)=  1), 
the  subsethood  measure  reduces  to  the  Lukasiewicz  implication  operator: 

S(A,  B)=  1  — ma  x(0,mA—mB) 

=  1  -[1  -min (1-0,  l  -K-mj))] 

=  min(l,  1  —mA  +  mB) 


which  clearly  generalizes  the  truth-tabular  definition  of  bivalent  implication. 

Consider  the  fit  vectors  A  =(0.2  00.4  0.5)  and  B=( 0.70.60.30.7).  Neither  set  is  a 
proper  subset  of  the  other.  A  is  almost  a  subset  of  B  but  not  quite  since 
mB(*3)  =  0-4-0.3  =  0.l>0.  Hence  S(/1,B)  =  1—  =  Similarly  S(B,A)  = 

1  1.3 _ 10 

23-21- 

The  concept  of  subsethood  applies  to  nonfuzzy  sets.  Consider  the  sets 
C  —  {3C1,X2,Xj,3Cj,X7,X9,Xj0,Xi2,X14} 

and 


D  {-*2*  **3>  *^4*  **6>  ^7*  ^8* 

with  corresponding  bit  vectors 


C  =  (l  110101011010  1) 

D  =(0  1111111110  111) 

C  and  D  are  not  subsets  of  each  other.  But  C  should  very  nearly  be  a  subset  of  D 
since  only  x,  violates  the  dominated  membership  function  relationship.  We  find 
S(C,  D)=  1  -9  =  5  while  S(D,  C)  =  1  — ^  =  §.  So  D  is  more  a  subset  of  C  than  it  is 
not.  This  is  because  the  two  sets  are  largely  equivalent.  They  have  much  overlap: 
M(Cc\D)  =  8.  This  observation  motivates  the  Fuzzy  Subsethood  Theorem  pre¬ 
sented  below.  First,  though,  we  present  a  new  geometric  derivation  of  the 
subsethood  measure. 

Consider  the  sets-as-points  geometry  of  subsethood  in  Figure  7.  Set  A  is  either 
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in  the  hyper-rectangle  F(2®)  or  not.  Intuitively  the  subsethood  of  A  in  B  should  be 
nearly  unity  when  A  is  arbitrarily  close  to  the  fuzzy  power  set  F{2B).  The  farther 
away,  the  less  the  subsethood  S(A,  B)  or,  equivalently,  the  greater  the 
supersethood. 

So  the  key  idea  is  metrical:  How  close  is  A  to  F( 2®)?  Let  d(A ,  F(2®))  denote  this 
lp  distance.  There  is  a  distance  d(/l,F)  between  A  and  every  point  B"  in  the 
hyper-rectangle,  every  subset  ff  of  B.  The  distance  d(A,  F{28))  is  the  smallest  such 
distance.  Since  the  hyper-rectangle  F( 2s)  is  geometrically  well  behaved — F(2B)  is 
closed  and  bounded  (compact)  and  convex — some  subset  B*  of  B  achieves  this 
minimum  distance.  So  the  infimum,  the  greatest  lower  bound,  is  the  distance 
d(A,B *): 


d(A,  F(2*))  =  inf  {d(A,  S'):  B'  e  F(2B)} 


=  d{A,B *). 


The  closest  set  B*  is  easy  to  locate  geometrically.  In  the  Euclidean  or  f2  case, 
this  is  formally  due  to  the  geometric  Hahn-Banach  Theorem  since  F(2B)  is  convex. 
If  A  is  a  subset  of  B —  if  A  is  in  the  hyper-rectangle  F(2B) — then  A  itself  is  the 
closest  subset:  A  =  B*.  So  suppose  A  is  not  a  proper  subset  of  B. 

The  unit  cube  /"  can  be  sliced  into  2"  hyper-rectangles  by  extending  the  sides  of 
F(2B)  to  hyperplanes.  The  hyperplanes  intersect  perpendicularly  (orthogonally),  at 
least  in  the  Euclidean  case.  F(2B)  is  one  of  the  hyper-rectangles.  The  hyper- 
rectangle  interiors  correspond  to  the  2"  cases  whether  mA(xi)<mB{xi)  or  m/4(xi)  > 
mB(x,)  for  fixed  B  and  arbitrary  A.  The  edges  are  the  loci  of  points  when  some 

The  2"  hyper-rectangles  can  be  classified  as  mixed  or  pure  membership 
domination.  In  the  pure  case  either  mA<mB  or  mA>mB  holds  in  the  hyper- 
rectangle  interior  for  all  x  and  all  interior  points  A.  In  the  mixed  case 
mA(xt)  <mB(xi)  holds  for  some  of  the  coordinates  x,  and  mA(Xj)>mB(xj)  holds  for 
the  remaining  coordinates  Xj  in  the  interior  for  all  interior  A.  So  there  are  only 
two  pure  membership-domination  hyper-rectangles,  the  set  of  proper  subsets  F(2B) 
and  the  set  of  proper  supersets. 

Figure  8  illustrates  how  the  fuzzy  power  set  F(2B)  of  B=(^  f)  can  be  linearly 
extended  to  partition  the  unit  square  into  22  =  4  rectangles.  The  non-subsets  At, 
A  2  and  A3  reside  in  distinct  quadrants.  The  northwest  and  southeast  quadrants 
are  the  mixed  membership-domination  rectangles.  The  southwest  and  the  north¬ 
east  quadrants  are  the  pure  rectangles. 

The  nearest  set  B*  to  A  in  the  pure  superset  hyper-rectangle  is  B  itself.  The 
nearest  set  B*  in  the  mixed  case  is  found  by  drawing  a  perpendicular  (orthogonal) 
line  segment  from  A  to  F(2B).  Convexity  of  F(2a)  is  responsible.  In  Figure  8  the 
perpendicular  lines  from  Ax  and  A3  intersect  line  edges  (1 -dimensional  linear 
subspaces)  of  the  rectangle  F( 2°).  The  line  from  A2  to  B,  the  comer  of  F( 2s),  is 
degenerately  perpendicular  since  B  is  a  zero-dimensional  linear  subspace. 

These  “orthogonality”  conditions  are  more  pronounced  in  three  dimensions.  Let 
your  room  again  be  the  unit  3-cube.  Consider  a  large  dictionary  fit  snugly  against 
the  floor  corner  corresponding  to  the  origin.  Point  B  is  the  dictionary  comer 
farthest  from  the  origin.  Extending  the  three  exposed  faces  of  the  dictionary 
partitions  the  room  into  8  octants,  one  of  which  is  occupied  by  the  dictionary. 
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x=  (1  1) 


{xx}  =  (l  0) 


Figure  8  Partition  of  hypercube  f"  into  2*  hyper-rectangles  by  linearly  extending  F(2*).  The  nearest 
points  B*  and  B*  to  points  and  As  in  the  northwest  and  southeast  quadrants  are  found  by  the 
normals  from  F(2*)  to  A,  and  Ay  The  nearest  point  B*  to  point  A2  in  the  northeast  quadrant  is  itselL 
This  “orthogonal”  optimality  condition  allows  d(A,B)  to  be  given  by  the  general  Pythagorean  Theorem 
as  the  hypotenuse  in  an  (*  “right”  triangle. 


Points  in  the  other  7  octants  are  connected  to  the  nearest  points  on  the  dictionary 
by  lines  that  perpendicularly  intersect  one  of  the  three  exposed  faces,  one  of  the 
three  exposed  edges,  or  the  corner  B. 

The  “orthogonality”  condition  invokes  the  /'-version  of  the  Pythagorean 
Theorem.  For  our  Z1  purposes: 

d{A,B)  =  d{A,B*)  +  d{B,B*). 

The  more  familiar  ^-version,  actually  pictured  in  Figure  8,  requires  squaring  these 
distances.  For  the  general  /'  case: 

or  equivalently. 


S  k-M’=  I  k-»<T+ 1 IV-M'- 

«  =  1  i  =  1  i  =  1 

Equality  holds  for  all  p^l  since,  as  is  clear  from  Figure  8  and  in  general,  from  the 
algebraic  argument  below,  either  b*  =  n,  or  b*  =  b,. 

This  Pythagorean  equality  is  surprising.  We  have  come  to  think  of  the 
Pythagorean  Theorem  (and  orthogonality)  as  an  or  Hilbert  space  property.  Yet 
here  it  holds  in  every  (p  space — if  B*  is  the  set  in  F(2ff)  closest  to  A  in  (p  distance. 
Of  course  for  other  sets  strict  inequality  holds  in  general  if  pi^2.  This  suggests  a 
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x  = 


{*1 


( 


} 


1  1) 


=  (1  0) 


Figure  9  Dependence  of  subsethood  on  the  count  M(A).  A ,  and  A2  are  equidistant  to  F( 2®)  but  At  is 
doser  to  B  than  A2  is;  correspondingly,  M(At)>  M(A2).  Loci  of  points  A  of  constant  count  M(A)  arc 
line  segments  parallel  to  the  negatively  sloping  long  diagonaL  /l  spheres  centered  at  B  arc  diamond 
shaped. 


special  status  for  the  closest  set  B*.  We  shall  see  below  that  the  Subsethood 
Theorem  confirms  this  suggestion.  We  shall  use  the  term  “orthogonality”  loosely 
to  refer  to  this  Cv  Pythagorean  relationship,  while  remembering  its  customary 
restriction  to  i1  spaces  and  inner  products. 

The  natural  suggestion  is  to  define  supersethood  as  the  distance  d(A,  F(2®))  = 
d(A,B*).  Supersethood  increases  with  this  distance,  subsethood  decreases  with  it. 
To  keep  supersethood,  and  thus  subsethood,  unit-interval  valued,  the  distance 
must  be  suitably  normalized. 

The  simplest  way  to  normalize  d(A,  B *)  is  with  a  constant:  the  maximum  unit- 
cube  distance,  nllp  in  the  general  fp  case  and  n  in  our  case.  This  gives  the 
candidate  subsethood  measure 


S(A,B)  =  1  — 


d(A,B*) 

n 


This  candidate  subsethood  measure  fails  in  the  boundary  case  when  B  is  the 
empty  set.  For  then  d(A,  B*)  =  d{A,B)  =  M{A).  So  the  measure  gives  S(A,0)  = 

1  —  {M(A)fn)>Q.  Equality  holds  exactly  when  A—X.  But  the  empty  set  has  no 
subsets.  The  only  normalization  factor  that  ensures  this  is  the  count  M(A).  Of 
course  M{A)—n  when  A  —  X. 

Normalizing  by  n  also  treats  all  equidistant  points  the  same.  Consider  points  A  t 
and  A  2  in  Figure  9.  Both  points  are  equidistant  to  their  nearest  F( 2s)  point: 
d(Ai,B*)  =  d(A2,B*1).  But  /l,  is  closer  to  B  than  A2 ■  In  particular  A ,  is  closer  to 
the  horizontal  line  defined  by  the  fit  value  01^X2)  =  3.  The  variable  quantity  that 
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reflects  this  is  the  count  M{A):  M(AX)>  M(A2).  The  count  gap  M(Ay)-- M(A2)  is 
due  to  the  fit  gap  involving  x„  and  reflects  d(Al,B)<d(A2,B).  In  general  the  count 
M(A)  relates  to  this  distance,  as  can  be  seen  by  checking  extreme  cases  of  closeness 
of  A  to  B  (and  drawing  some  diamond -shaped  /l  spheres  centered  at  B).  Indeed  if 
mA>mB  everywhere,  d(A,  B)  =  M(A)  —  M(B). 

Since  F( 2®)  fits  snugly  against  the  origin,  the  count  M(A)  in  any  of  the  other 
2"— 1  hyper-rectangles  can  only  be  larger  than  the  count  M(B*)  of  the  nearest 
F(2*)  points.  The  normalization  choice  of  n  leaves  the  candidate  subsethood 
measure  indifferent  to  which  of  the  2"  —  1  hyper-rectangles  A  is  in  and  to  where  A 
is  in  the  hyper-rectangle.  Each  point  in  each  hyper-rectangle  involves  a  different 
combination  of  fit  violations  and  satisfactions.  The  normalization  choice  of  M(A) 
reflects  this  fit  violation  structure  as  well  as  behaves  appropriately  in  boundary 
cases. 

The  normalization  choice  M(A)  leads  to  the  subsethood  measure 


S(A,B)  =  1 


d(A,B*) 
M(A)  ‘ 


We  now  show  that  this  measure  is  equal  to  the  subsethood  measure  derived 
algebraically  above. 

Let  F  be  any  subset  of  B.  Then  by  definition  the  nearest  subset  B*  obeys  the 
inequality: 


where  for  convenience  a,  =  /n/4(xi)  and  similarly  for  the  b(  fit  values.  We  will  assume 
p=  1  but  the  following  characterization  of  b*  is  valid  for  any  p>  1. 

By  orthogonality  we  know  that  a,  is  at  least  as  big  as  b*.  So  first  suppose 
a,  ~b*.  This  occurs  if  and  only  if  no  violation  occurs:  (If  this  holds  for  all  i, 

then  A  =  B*.)  So  max  (0,0;  — 6^=0.  Next  suppose  a,>b*.  This  occurs  if  and  only  if 
a  violation  occurs:  a,  >  £>,.  (If  this  holds  for  all  i,  then  B  =  B*.)  So  b*  =  b{  since  B*  is 
the  subset  of  B  nearest  to  A.  Equivalently,  a,  >6,  holds  if  and  only  if 
max  (0,  o,  —  b,)  =  a,  —  b,.  So  the  two  cases  together  prove  that  max  (0,  a,  —  6,)  = 
\at  —  b*\.  Summing  over  all  x,  gives: 


m 

d{A,B*)=  X  n*ax(0,m^(x,)-mB(xI)). 


<  =  i 


So  the  two  subsethood  measures  are  equivalent. 

This  proof  also  proves  a  deeper  characterization  of  the  optimal  subset 
B*:  B*  =  A  n  B.  For  if  a  violation  occurs,  a(>6,  and  bi  =  b*.  So  min  (fl;,  bt)  =  b*. 
Otherwise  at  =  b *,  and  so  min  (a„  6,)  =  b*. 

This  in  turn  proves  that  B*  is  a  point  of  double  optimality.  Not  only  is  B*  the 
subset  of  B  nearest  A,  B*  is  also  A*,  the  subset  of  A  nearest  to  B:d(B,F{ 2A))  = 
d(B,  A*)  ~d(B,  B*).  Figure  10  illustrates  that  B*  =  Ar\B  =  A*  is  the  set  within  both 
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X=  (1  1) 


{Xil  =  (l  0) 


Figure  10  B*  as  both  the  subset  of  B  nearest  A  and  the  subset  A*  of  A  nearest  B:  B*  =  A*—Ar\B. 
The  distance  d(A,B*)  =  M(A)  —  M(AnB)  illustrates  the  Subsethood  Theorem. 


the  hyper-rectangle  F(2X)  and  the  hyper-rectangle  F{2*)  that  has  maximal  count 
M(AnB). 

Figure  10  also  shows  that  the  distance  d{A,  B*)  is  a  vector  magnitude  difference: 
d(A,  B*)  =  M{A)  —  M(AnB).  Dividing  both  sides  of  this  equality  by  M{A)  and 
rearranging  proves  a  surprising  and  still  deeper  structural  characterization  of 
subsethood,  the  Subsethood  Theorem. 

Subsethood  Theorem 


M(AnB) 
M{A)  ' 


The  ratio  form  of  the  subsethood  measure  S(A,B)  is  familiar.  It  is  the  same  as 
the  ratio  form  of  the  conditional  probability  P{B\A).  The  fundamental  difference  is 
that  the  ratio  form  is  derived  for  the  subsethood  measure  S(A,B)  but  assumed  for 
the  conditional  probability  P(B\A).  This  is  the  difference  between  showing  and 
telling.  The  inability  to  derive  conditional  probability  further  suggests  that 
probability  is  not  real.  For  every  probability  is  a  conditional  probability, 
P{A)  =  P{A\X). 

Consider  first  the  physical  interpretation  of  randomness  as  a  relative  frequency. 
The  Subsethood  Theorem  suggests  that  randomness  is  a  working  fiction  akin  to 
the  luminiferous  ether  of  nineteenth-century  physics — the  phlogiston  of  thought. 
For  in  one  stroke  we  can  now  derive  the  relative  frequency  definition  of 
probability  as  S(X,  A),  the  degree  to  which  a  bivalent  superset  X,  the  sample  space, 
is  a  subset  of  its  own  subset  A.  The  concept  of  randomness  never  enters  the 
deterministic  framework. 

Suppose  A  and  B  are  nonfuzzy  subsets  of  X.  {X,  like  every  observed  set,  is  at 
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most  countably  infinite.)  Suppose  A  is  a  subset  of  B.  In  the  extreme  case  B  =  X. 
Then  the  degree  of  subsethood  S(B,  A)  is  what  has  traditionally  been  called  a 
relative  frequency : 


S(B,A ) 


M(A) 

M(B) 


where  the  N  elements  of  B  constitute  the  de  facto  universe  of  discourse  of  the 
“experiment”  (Of  course  the  limit  of  the  ratio  S{B,  A)  can  be  taken  if  it 
mathematically  makes  sense.7)  The  probability  nJN  has  been  reduced  to  degrees 
of  subsethood,  a  purely  fuzzy  set-theoretical  relationship.  An  immediate  historical 
speculation  is  that  if  set  theory  had  been  more  carefully  worked  out  first,  the 
notion  of  “randomness”  might  never  have  culturally  evolved. 

A  classical  example  of  relative  frequency  is  the  number  nA  of  successful  trials  in 
N  trials.  A  biological  example  is  the  number  of  blue-eyed  genes  or  alleles  at  all 
such  chromosomal  loci  in  a  gene  pool.  The  new  way  of  expressing  these  relative 
frequencies  S(B,  A)  is  the  degree  to  which  all  trials  are  successes  or  all  genes  at  a 
specific  chromosomal  location  are  for  blue-eyedness.  If  the  distinction  between 
successful  and  unsuccessful  trials  is  not  dear  cut,  the  resulting  fuzzy  relative 
frequency  S(B,A)  may  be  real-valued.  The  frequency  structure  remains  since  A  is  a 
subset  of  B  (since  B-X  invariably  in  practice). 

Where  did  the  “randomness”  go?  The  relative  frequency  S(B,  A)  describes  a  fuzzy 
state  of  affairs,  the  degree  to  which  B  belongs  to  the  power  set  of  A:  S(B,A)  = 
m2A{B).  (Consider  B  =  X  and  A  =  {x2}  in  the  unit  square:  the  frequency  S(AT,A) 
corresponds  by  the  Pythagorean  TTieorem  to  the  ratio  of  the  left  cube  edge  and 
the  long  diagonal  to  X.)  Whether  S(B,  A)  is  a  rational  or  irrational  number  seems 
a  technicality,  a  matter  of  fineness  of  quantization,  if  it  is  not  zero  or  one.  In 
practice  only  physical  objects  like  tossed  coins  and  DNA  strands  are  involved. 
Their  individual  behavior  might  be  fully  determined  by  a  system  of  differential 
equations. 

The  key  quantity  is  the  measure  of  overlap  M(AnB).  This  count  does  not 
involve  “randomness”.  It  counts  which  elements  are  identical  or  similar  and  to 
what  degree.  The  phenomena  themselves  are  deterministic.  The  corresponding 
frequency  number  that  summarizes  the  deterministic  situation  is  also  deterministic. 
The  same  situation  always  gives  the  same  number.  The  number  may  be  used  also 
to  place  bets  or  to  switch  a  phone  line,  but  it  remains  part  of  the  description  of  a 
specific  state  of  affairs.  The  deterministic  subsethood  derivation  of  relative 
frequency  eliminates  the  need  to  invoke  an  undefined  “randomness”  to  further 
describe  the  situation. 

The  identification  of  relative  frequency  with  probability  is  cultural,  not  logical. 
This  may  take  getting  used  to  after  hundreds  of  years  of  casting  gambling 
intuitions  as  matters  of  probability  and  a  century  of  building  probability  into  the 
description  of  the  universe.  It  is  ironic  that  to  date  every  assumption  of 
probability— at  least  in  the  relative  frequency  sense  of  science,  engineering, 
gambling,  and  daily  life — has  actually  been  an  invocation  of  fuzziness. 
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9.  BAYESIAN  POLEMICS 


Bayesian  probabilists  interpret  probability  as  a  subjective  state  of  knowledge.  In 
practice  they  use  relative  frequencies  (subsethood  degrees)  but  only  to  approximate 
these  “states  of  knowledge”. 

Bayesianism  is  a  polemical  doctrine.  Bayesians  claim  that  they,  and  only  they, 
use  »!1  and  only  the  available  uncertainty  information  in  the  description  of 
uncertain  phenomena.  This  stems  from  the  Bayes  Theorem  expansion  of  the  “a 
posteriori"  conditional  probability  P(H{\  £),  the  probability  that  H„  the  ith  of 
fc-many  disjoint  hypotheses  {//;},  is  true  when  evidence  E  is  observed: 


P(tfi|£)  = 


P(Enff,) 

P(£) 


P(E\ //,)P(//,) 
P(£) 


P(E[//,-)P(//,) 

2=iP(£|h,)P(/// 


since  the  hypotheses  partition  the  sample  space  X:  HlvH2v  vHk—X  and 
Hir\HJ—0  if  i#;\ 

Conceptually,  Bayesians  use  all  available  information  in  computing  this  pos¬ 
terior  distribution  by  using  the  “a  priorT  or  prior  distribution  P(Ht)  of  the 
hypotheses.  Mathematically,  the  Bayesian  approach  clearly  stems  from  the  ratio 
form  of  the  conditional  probability. 

The  Subsethood  Theorem  trivially  implies  Bayes  Theorem  when  the  hypotheses 
{//,}  and  evidence  £  are  nonfuzzy  subsets.  More  important,  the  Subsethood 
Theorem  implies  the  Fuzzy  Bayes  Theorem  in  the  more  interesting  case  when  the 
observed  data  £  is  fuzzy: 


,(F  S(HhE)M(Hi) 


where 


E77 


I 


M(Hj)  _  Miff') 
M(X )  n 


=  S(X,H() 


is  the  “relative  frequency”  of  the  degree  to  which  all  the  hypotheses  are  H,.  So 
the  Subsethood  Theorem  allows  fuzzyists  to  be  “Bayesians”  as  well. 

The  Subsethood  Theorem  implies  inequality  when  the  partitioning  hypotheses 
are  fuzzy.  For  instance,  if  k  =  2,  H'  is  the  complement  of  an  arbitrary  fuzzy  set  H , 
and  evidence  £  is  fuzzy,  then10  the  occurrence  of  nondegenerate  hypothesis 
overlap  and  underlap  gives  a  lower  bound  on  the  posterior  subsethood: 
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S(E,H )£ 


where  fH  =  S(X,H).  The  lower  bound  is  an  increasing  function  of  M{H),  a 
decreasing  function  of  M(/T).  Since  a  like  lower  bound  holds  for  S(E,  H%  adding 
the  two  posterior  subsethoods  gives  the  additive  inequality: 

$(£,//) +  S(£,/£)£;  l, 

an  inequality  arrived  at  independently  by  Zadeh17  by  directly  defining  a  “relative 
sigma-count”  as  the  subsethood  measure  given  by  the  Subsethood  Theorem.  If  H 
is  nonfuzzy,  equality  holds  as  in  the  additive  law  of  conditional  probability: 

P(H\E)  +  P(HC\E)=  1. 

The  Subsethood  Theorem  implies  a  deeper  Bayes  Theorem  for  arbitrary  fuzzy 
sets,  the  Odds-Form  Fuzzy  Bayes  Theorem: 

S(A2nH,Ai)  S(H,A2) 

S(Atr\H,  A2)  S(Ac2r\H,Ai)  S{H,Ac2Y 


This  theorem  is  proved  directly  by  replacing  the  subsethood  terms  on  the 
righthand  side  with  their  equivalent  ratios  of  counts,  canceling  like  terms  three 
times,  multiplying  by  M(AlnH)fM(Alr\H),  rearranging,  and  applying  the  Sub¬ 
sethood  Theorem  a  second  time. 

We  have  now  developed  enough  fuzzy  theory  to  examine  critically  the  recent 
anti-fuzzy  polemics  of  Lindley13  and  Jaynes6  (and  thus  Cheescman2  who  uses 
Jaynes’  arguments).  To  begin  we  observe  four  more  corollaries  of  the  Subsethood 
Theorem: 

i)  G^S{H,A)^\, 

u)  S{H,A)  =  l  if  HczA, 

iii)  S{H,AlKjA2)  =  S{H,Ai)  +  S{H,A2)-S{H,AlnA2), 

iv)  S{H,AlnA2)  =  S(H,Ai)S(AlnH,A2)- 

Each  relationship  follows  from  the  ratio  form  of  S(/4,J3).  The  third  relation¬ 
ship  uses  the  additivity  of  the  count  M(A),  which  follows  from  min(x,y)  + 
max(x,  y)  =  x  +  y. 

Now  make  the  notational  identification  S(H,  A)  =  P(A\H).  We  then  obtain  the 
defining  relationships  of  conditional  probability  proposed  by  Lindley:13 

Convexity:  0£  P(A\H)^  1  and  P(A\H)=  1  if  H  implies  A, 

P(A,  uA2\H)  =  P(Al\H)  +  P(AI\H)  —  P(Ai  nA2\H), 


Addition : 
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Multiplication:  P(  A  t  n  A  2 1  //)  =  F(/4 1 1  H)P(A2  M  i n  H). 

“From  these  three  rules”,  Lindley13  tells  us, 

all  of  the  many,  rich  and  wonderful  results  of  the  probability  calculus  follow.  They  may  be  described  as 
the  axioms  of  probability. 

Lindley  takes  these  as  “unassailable"  axioms: 

We  really  have  no  choice  about  the  rules  governing  our  measurement  of  uncertainty:  they  arc  dictated 
to  us  by  the  inexorable  laws  of  logic. 

Lindley  proceeds  to  build  a  “coherence”  argument  around  the  Odds-Form  Bayes 
Theorem,  which  he  correctly  deduces  from  the  axioms  as  the  equality: 


P(A2 

Atr\H) 

P(A t 

A2nH)  P(A2 

H) 

P(A\ 

Atr\H) 

P{Ay 

Ac2kH)  P(A% 

H)' 

where  here  we  interpret  Ac  as  not-A.  “Any  other  procedure”,  wc  are  told,  “is 
incoherent.”  This  polemic  evaporates  in  the  face  of  the  above  four  subsethood 
corollaries  and  the  Odds-Form  Fuzzy  Bayes  Theorem.  Ironically,  rather  than 
establish  the  primacy  of  axiomatic  probability,  Lindley  seems  to  argue  that  it  is 
fuzziness  in  disguise. 

Another  source  of  Bayesian  probability  polemic2  is  maximum  entropy  estima¬ 
tion.  Here  the  axiomatic  argument  rests  on  the  so-called  Cox’s  Theorem.3  Cox’s 
Theorem  is  best  presented  by  its  most  vocal  proponent,  physicist  E.  T.  Jaynes. 
According  to  Jaynes:6 

Cox  proved  that  any  method  of  inference  in  which  we  represent  degrees  of  plausibility  by  real  numbers, 
is  necessarily  either  equivalent  to  Laplace’s,  or  inconsistent, 

where  Laplace  is  cited  as  an  early  Bayesian  probabilisL  In  fact  Cox  used  bivalent 
logic  (Boolean  algebra)  and  other  assumptions  to  show  that,  again  according  to 
Jaynes,  the  “conditions  of  consistency  can  be  stated  in  the  form  of  functional 
equations,”  namely  the  probabilistic  product  and  sum  rules: 

P{AnB\C)  =  P{A\BnQP{B\Q, 

P{B\A)  +  P{Be\A)=  1. 


The  Subsethood  Theorem  implies 

S(C,AnB)  =  S(BnCy  A)S(C,  B), 

S{A,B)  +  S{A,BT\Z  1, 

with,  as  we  h*.ve  seen,  equality  holding  for  the  second  subsethood  relationship 
when  B  is  nonfuzzy,  which  is  the  case  in  the  Cox-Jaynes  setting. 

In  the  probabilistic  case  overlap  and  underlap  are  degenerate.  So 
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P(AnA'\B)  =  P(0\B)  =  yQj=O, 

and  P{B\Ar\Ac)~P(B\0)  is  undefined.  Yet  in  general  S(B,/1^/4C)>0  and 
S(AnAc,B)  is  defined  when  A  is  fuzzy  and  B  is  fuzzy  or  nonfuzzy. 

Jaynes’  claim  is  either  false  or  concedes  that  probability  is  a  special  case  of 
fuzziness.  For  strictly  speaking,  since  the  subsethood  measure  S(A,B)  satisfies  the 
multiplicative  and  auditive  laws  specified  by  Cox  and  yet  differs  from  the 
conditional  probability  P(B\A),  Jaynes’  claim  is  false. 

Presumably  Jaynes  was  unaware  of  fuzzy  sets.  He  seems  to  suggest  that  the  only 
alternative  uncertainty  theory  is  the  frequency  theory  of  probability,  a  theory  we 
have  seen  reduced  to  the  subsethood  measure  S(X,A).  So  if  we  restrict  consider¬ 
ation  to  nonfuzzy  sets  A  and  B,  equality  holds  in  the  above  subsethood  relations 
and  Jaynes  is  right:  probability  and  fuzziness  coincide.  But  fuzziness  exists,  indeed 
abounds,  outside  this  restriction  and  classical  probability  theory  does  not.  So  fuzzy 
theory  is  an  extension  of  probability  theory.  Equivalently,  probability  then  is  a 
special  case  of  fuzziness. 

Incidentally,  when  one  examines  Cox’s  actual  arguments,3  one  finds  that  Cox 
assumes  that  the  uncertainty  combination  operators  in  question  are  continuously 
twice  differentiable !  Min  and  max  are  not  twice  differentiable.  Technically,  Cox’s 
theorem  does  not  apply. 

10.  THE  ENTROPY -SUBSETHOOD  THEOREM 

The  Fuzzy  Entropy  Theorem  and  the  Subsethood  Theorem  were  independently 
derived  from  first  principles,  from  sets-as-points  unit-cube  geometry.  Both 
theorems  involve  ratios  of  cardinalities.  A  connection  is  inevitable. 

The  Entropy-Subsethood  Theorem  shows  that  the  connection  occurs  ir  terms  of 
overlap  Ar\Ac  and  underlap  AkjAc  (what  else?).  The  theorem  says  fuzzy  entropy 
can  be  eliminated  in  favor  of  subsethood.  So  subsethood  emerges  as  the 
fundamental,  characterizing  quantity  of  fuzziness — and,  arguably,  of  probability  as 
well. 

Entropy-Subsethood  Theorem 

E(/l)  =  S(/4  vj  Ac,  A  n  A*). 

The  theorem  is  proved  by  replacing  B  and  A  in  the  Subsethood  Theorem 
respectively  with  overlap  AnAc  and  underlap  AkjAc.  Since  overlap  is  a 
(dominated-membership  function)  subset  of  underlap,  the  intersection  of  the  two 
sets  is  just  overlap. 

The  Entropy-Subsethood  Theorem  is  a  peculiar  relationship.  It  says  that 
fuzziness  is  the  degree  to  which  the  superset  AkjAc  is  a  subset  of  its  own  subset 
Ar\Ac ,  the  extent  to  which  the  whole  is  a  part  of  one  of  its  own  parts,  a 
relationship  forbidden  by  Western  logic. 

This  relationship  violates  our  ingrained  Venn-diagram  intuitions  of  unambi¬ 
guous  set  inclusion.  Only  the  midpoint  of  /"  yields  total  containment  of  underlap 
in  overlap.  The  cube  vertices  yield  no  containment.  This  parallels  in  the  extreme 
the  relative  frequency  relationship  S(X,A)  =  nA/N,  where  a  nonfuzzy  superset  X  is 
to  some  degree  a  subset  of  one  of  its  nonfuzzy  subsets  A. 
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Xi 

Figure  11  Entropy-Subscthood  Theorem  in  two  dimensions.  Just  as  the  long  diagonals  have  equal 
length,  d(A,A‘)  =  d(A\jA',AnA‘)  =  d*  =  Af(AvAc)  —  M(AnA‘),  the  shortest  distance  from  A<uA'  to  the 
fuzzy  power  set  of  Ac\A'. 


Figure  11  illustrates  the  Entropy-Subsethood  Theorem.  It  shows  that  d*,  the 
shortest  distance  from  underlap  AuAc  to  the  hyper-rectangle  defining  the  fuzzy 
power  set  of  overlap  An  A',  is  equivalent  to  d(AuAe,AnAt)=d(A,Ae)  and  to  a 
difference  of  vector  magnitudes:  d*  =  M(A<jAc)  —  M(AnAc). 

The  Entropy-Subsethood  Theorem  implies  that  no  probability  measure 
measures  fuzziness.  For  the  moment,  suppose  not.  Suppose  fuzzy  entropy  measures 
nothing  new;  fuzziness  is  simply  disguised  probability.  Suppose,  as  Lindley13 
claims,  that  probability  theory  “is  adequate  for  all  problems  involving  uncer¬ 
tainty.”  So  there  exists  some  probability  measure  P  such  that  P  =  E.  P  cannot  be 
identically  zero  because  P(A)=1.  Then  there  is  some  A  such  that  P(A)  =  E(A)>0. 
But  in  a  probability  space  there  is  no  overlap  or  underlap:  AnAc  =  0  and 
AkjAc  =  X. 

The  Entropy-Subsethood  Theorem  then  implies  that  0<P(/1)  =  £(/!)  = 
S(>4vj/T,/ln/4f)  =  S(A’,  0).  The  only  way  X  can  be  a  subset  to  any  degree  of  the 
empty  set  is  if  A"  itself,  and  hence  A,  is  empty:  X  =  A  =  0.  Then  the  sure  event  X  is 
impossible:  P(A)  =  P(0)  =  0.  Or  the  impossible  event  is  sure:  P(0)=1.  Either 
outcome  is  a  bivalent  contradiction,  impervious  to  normalization.  So  there  exists 
no  probability  measure  P  that  measures  fuzziness.  Fuzziness  exists. 

This  within-cube  theory  can  be  extended11  to  define  a  natural  fuzzy  integral  with 
respect  to  the  fuzzy  counting  measure  M.  A  more  practical  extension11  is  to 
mappings  between  fuzzy  cubes,  in  particular  to  fuzzy  associative  memories.  In 
short,  a  fuzzy  set  is  a  point  in  a  unit  hypercube  /".  A  fuzzy  system  S:  /"-*/p  is  a 
mapping  between  cubes.  Fuzzy  systems  map  fuzzy  subsets  of  the  input  space  X  to 
fuzzy  subsets  of  the  output  space  Y.  Fuzzy  systems  are  tools  of  machine 
intelligence,  and  can  be  applied  to  a  wide  range  of  control  and  decision  problems. 
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11.  PRECISE  PAST,  FUZZY  FUTURE4 

The  boat  of  uncertainty  reasoning  is  being  rebuilt  at  sea.  Plank  by  plank  fuzzy 
theory  is  beginning  to  gradually  shape  its  design.  Today  only  a  few  fuzzy  planks 
have  been  laid.  But  a  hundred  years  from  now,  a  thousand  years  from  now,  the 
boat  of  uncertainty  reasoning  may  little  resemble  the  boat  of  today.  Notions  and 
measures  of  overlap  Ar\Ac  and  underlap  AkjAc  will  have  smoothed  its  rudder. 
Amassed  fuzzy  applications,  hardware,  and  products  will  have  broadened  its  sails. 
And  no  one  on  the  boat  will  believe  that  there  was  a  time  when  a  concept  as 
simple,  as  intuitive,  as  expressive18  as  a  fuzzy  set  met  with  such  impassioned 
denial. 

How  would  the  world  be  different  today  if  fuzziness  had  been  developed,  taught, 
and  applied  before  probability  theory?  Suppose  the  fuzzy  framework  was  worked 
out  at  the  time  of  Galileo  of  Laplace.  Suppose  Isaac  Newton  included  an  appendix 
on  the  geometry  of  fuzzy  sets  in  his  Principia.  What  would  be  different  today? 

Reasoning  systems  in  machine  intelligence  would  surely  be  different.  So  would 
be  the  range  of  automatic  control  devices.  There  would  be  many  more  of  them, 
and  they  would  more  accurately  reflect  our  reasoning  processes  than  do  our 
current  decision  trees  and  thermostats.  Western  belief  systems  might  be  more 
Eastern,  and  vice  versa.  (How  many  Westerners  can  name  five  Eastern  books?) 
More  of  social  science  might  be  systematized.  Historical  tendencies  would  have 
been  easier  to  articulate  and  defend.  Communication,  signal  processing,  and 
computational  hardware  might  be  built  around  the  fit.  Our  physical  explorations 
of  subatomic  reality,  antimatter,  and  the  spacetime  fabric  may  have  led  to  different 
times  and  places.  Relative  frequencies  might  be  considered  the  everyday  appli¬ 
cation  of  fuzzy  subsethood.  Besides  betting  on  games  of  chance  or  frequency, 
betting  on  games  of  degree — perhaps  involving  simulated  chaotic  trajectories  in 
unit  cubes  (or  guppies  swimming  in  hand-held  cubical  aquaria)  or  real-valued 
dice — might  help  support  the  economy  of  Las  Vegas. 

As  the  total  amount  of  information  in  society  continues  to  grow  exponentially, 
the  velocity  of  scientific  and  cultural  change  increases.  Cultural  change  that  once 
took  centuries  can  now  occur  in  a  few  years,  perhaps  soon  in  a  single  year.  A 
current  engineering  example  of  this  velocity  of  change  is  Moore’s  Law,  the 
doubling  of  silicon-chip  transistor  density  every  one  to  two  years. 

One  tendency  of  this  information  acceleration  is  to  leave  further  behind  what 
has  already  been  explored.  The  complementary  tendency  is  to  soon  experiment 
with  systems  that  may  at  present  seem  distant,  impractical,  even  absurd.  In  this 
light  the  recent  developments  in  fuzzy  theory  and  in  fuzzy  applications  and 
hardware  will  surely  affect  the  science,  engineering,  and  culture  of  the  future.  The 
question  is  to  what  degree. 
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Abstract 

We  developed  fuzzy  and  neural-network  control  systems  to  back  up  a  simulated  truck, 
and  truck-and-trailer,  to  a  loading  dock  in  a  planar  parking  lot.  The  fuzzy  systems  per¬ 
formed  well  until  we  randomly  removed  over  50  %  of  their  fuzzy-assodative-memory  (FAM) 
rules.  They  also  performed  well  when  we  replaced  key  FAM  equilibration  rules  with  de¬ 
structive  or  “sabotage”  rules.  We  trained  the  neural  network  systems  with  the  supervised 
backpropagatio.i  learning  algorithm  and  tested  their  robustness  by  removing  random  sub¬ 
sets  of  training  data  in  learning  sequences.  The  neural  systems  performed  well  but  required 
extensive  computation  for  training.  We  used  unsupervised  differential  competitive  learn¬ 
ing  (DCL),  and  product-space  clustering,  to  adaptively  generate  FAM  rules  from  training 
data.  The  original  fuzzy  and  neural  control  systems  generated  trajectory  data.  The  DCL 
system  rapidly  recovered  the  underlying  FAM  rules.  Product-space  clustering  converted 
the  neural  truck  systems  into  structured  sets  of  FAM  rules  that  approximated  the  neural 
system’s  behavior. 
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Fuzzy  and  Neural  Control  Systems 


We  construct  fuzzy  and  neural  control  systems  directly  from  control  data,  but  from 
different  types  of  control  data.  Fuzzy  systems  use  a  small  number  of  structured  linguistic 
input-output  samples  from  an  expert  or  from  some  other  adaptive  estimator.  Neural 
systems  use  a  large  number  of  numeric  input-output  samples  from  the  control  process  or 
from  some  other  database.  Adaptive  fuzzy  systems  also  use  numeric  control  data. 

Figure  1  illustrates  this  difference.  The  neural  system  estimates  function  /  :  X  — ►  Y 
from  several  numerical  point  samples  (x;,y;).  The  fuzzy  system  estimates  /  from  a  few 
fuzzy  set  samples  or  fuzzy  associations  ( Ai,B{ ). 


FIGURE  1  Geometry  of  neural  and  fuzzy  function  estimation.  The  neural 
approach  (a)  uses  several  numerical  point  samples.  The  fuzzy  approach  (b) 
uses  a  few  fuzzy  set  samples. 

Fuzzy  and  neural  systems  offer  a  key  advantage  over  traditional  control  approaches. 
They  offer  model-free  estimation  of  the  control  system.  The  user  need  not  specify  how 
the  controller’s  output  mathematically  depends  on  its  input.  Instead  the  user  provides  a 
few  common-sense  associations  of  how  the  control  variables  behave.  Or  the  user  provides 
a  statistically  representative  set  of  numerical  training  samples.  Even  if  a  math-model 
controller  is  available,  fuzzy  or  neural  controllers  may  prove  more  robust  and  easier  to 
modify. 

Which  system,  fuzzy  or  neural,  performs  better  for  which  type  of  control  problem  de- 
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pends  on  the  type  and  availability  of  sample  data.  If  experts  provide  structured  knowledge 
of  the  control  process,  or  if  sufficient  numerical  training  samples  are  unavailable,  the  fuzzy 
approach  may  be  preferable.  We  can  construct  a  fuzzy  control  system  with  comparative 
ease  when  experts  or  fuzzy  engineers  provide  accurate  structured  knowledge.  A  fuzzy  con 
trol  system  seems  a  reasonable  benchmark  in  such  cases,  even  if  we  can  develop  a  neural 
controller  or  math-model  controller. 

If  we  have  representative  numerical  data  but  not  structured  expertise,  the  neural  ap¬ 
proach  may  be  preferable.  Or  a  statistical  regression  approach  may  be  more  appropriate. 
The  data  simply  tell  their  own  story — if  there  is  a  story  to  tell.  Yet  even  here  we  can 
use  a  hybrid  fuzzy-neural  system,  an  adaptive  fuzzy  system.  We  can  use  the  numerical 
data  to  generate  fuzzy  associative  memory  (FAM)  rules.  The  FAM  rules  can  then  form  the 
skeleton  of  a  fuzzy  control  architecture.  In  short,  if  structured  knowledge  is  unavailable, 
estimate  it.  This  may  be  more  practical  than  it  would  appear  because  of  the  small  number 
of  control  FAM  rules  needed  to  reliably  control  many  realworld  processes. 

How  can  we  compare  fuzzy  and  neural  controllers?  Abstract  comparison  proves  difficult 
because  both  approaches  build  a  control  black  box  in  different  ways.  That  they  build  black 
boxes  distinguishes  them  from  math-model  controllers.  It  also  suggests  we  can  compare 
them,  at  least  approximately,  by  their  black-box  control  performance. 

Each  control  system  generated  an  output  control  surface  as  it  ranged  over  the  common 
input  space  of  parameter  values.  Figure  5  below  shows  three-dimensional  control  surfaces 
for  the  fuzzy  and  neural  controllers.  For  control  systems  with  few  input  parameters  with 
moderately  quantized  ranges,  we  can  store  both  fuzzy  and  neural  controllers — or  rather 
their  quantized  control  surfaces — as  decision  look-up  tables.  Then  once  we  specify  a  system 
performance  criterion,  we  can  in  principle  quantitatively  compare  the  controllers. 

Comparing  system  trajectories  proved  more  complicated.  In  the  case  at  hand,  we 
wanted  to  back  up  a  truck,  and  truck-and-trailer,  to  a  loading  dock.  We  can  measure  and 
compare  the  quality  and  quantity  of  the  truck  trajectory,  perhaps  with  mean-squared  er¬ 
ror  criteria.  Intuitively,  we  preferred  smooth  short  trajectories  to  jagged  long  trajectories. 
Reaching  the  loading-dock  goal  was  also  important.  In  practice  it  is  the  most  impor¬ 
tant  performance  requirement.  We  must  balance  the  trajectory  type  with  the  trajectory 
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destination,  and  this  reduces  to  the  pragmatic  issue  of  balancing  means  and  ends. 

Below  we  develop  a  simple  fuzzy  control  system  and  a  simple  neural  control  system 
for  backing  up  a  truck,  and  truck-and-trailer,  in  an  open  parking  lot.  The  recent  neural 
network  truck  backer-upper  simulation  of  Nguyen  and  Widrow  [1989]  motivated  our  choice 
of  control  problem. 

The  fuzzy  control  system  compared  favorably  with  the  neural  controller  in  terms  of 
black-box  development  effort,  black-box  computational  load,  smoothness  of  truck  trajec¬ 
tories,  and  robustness. 

We  studied  robustness  of  the  fuzzy  control  systems  in  two  ways.  We  deliberately  added 
confusing  FAM  rules — “sabotage”  rules — to  the  system,  and  we  randomly  removed  differ¬ 
ent  subsets  of  FAM  rules.  We  studied  robustness  of  the  neural  controller  by  randomly 
removing  different  portions  of  the  training  data  in  learning  sequences.  We  also  converted 
the  neural  control  systems  to  structured  FAM-bank  systems. 


Backing  up  a  truck 

Figure  2  shows  the  simulated  truck  and  loading  zone.  The  truck  corresponds  to  the  cab 
part  of  the  neural  truck  in  the  Nguyen- Widrow  neural  truck  backer-upper  system.  The 
three  state  variables  <j>,  x,  and  y  exactly  determine  the  truck  position.  <j>  specifies  the  angle 
of  the  truck  with  the  horizontal.  The  coordinate  pair  (x,y)  specifies  the  position  of  the 
rear  center  of  the  truck  in  the  plane. 

The  goal  was  to  make  the  truck  arrive  at  the  loading  dock  at  a  right  angle  {<f>j  =  90°) 
and  to  align  the  position  (x,y)  of  the  truck  with  the  desired  loading  dock  (xy,y/).  We 
considered  only  backing  up.  The  truck  moved  backward  by  some  fixed  distance  at  every 
stage.  The  loading  zone  corresponded  to  the  plane  [0,100]  X  [0,100],  and  (x/,y/)  equaled 
(50,100). 

At  every  stage  the  fuzzy  and  neural  controllers  should  produce  the  steering  angle  6  that 
backs  up  the  truck  to  the  loading  dock  from  any  initial  position  and  from  any  angle  in  the 
loading  zone. 
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FIGURE  2  Diagram  of  simulated  truck  and  loading  zone. 


Fuzzy  Truck  Backer-Upper  System 

We  first  specified  each  controller’s  input  and  output  variables.  The  input  variables  were 
the  truck  angle  <}>  and  the  z-position  coordinate  x.  The  output  variable  was  the  steering- 
angle  signal  0.  We  assumed  enough  clearance  between  the  truck  and  the  loading  dock  so 
we  could  ignore  the  ^-position  coordinate.  The  variable  ranges  were  as  follows: 

0  <  x  <  100  , 

-90  <  <f>  <  270  , 

-30  <  0  <  30  . 

Positive  values  of  0  represented  clocKwise  rotations  of  the  steering  wheel.  Negative  values 
represented  counterclockwise  rotations.  We  discretized  all  values  to  reduce  computation. 
The  resolution  of  <f>  and  0  was  one  degree  each.  The  resolution  of  x  was  0.1. 

Next  we  specified  the  fuzzy-set  values  of  the  input  and  output  fuzzy  variables.  The 
fuzzy  sets  numerically  represented  linguistic  terms,  the  sort  of  linguistic  terms  an  expert 
might  use  to  describe  the  control  system’s  behavior.  We  chose  the  fuzzy-set  values  of  the 
fuzzy  variables  as  follows: 
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ZE: 
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Left  Vertical 

RI: 

Right 

PS: 

Positive  Small 

LU: 

Left  Upper 

PM: 

Positive  Medium 

LB: 

Left  Below 

PB: 

Positive  Big 

Fuzzy  subsets  contain  elements  with  degrees  of  membership.  A  fuzzy  membership 
function  tu*  :  Z  — ►  [0, 1]  assigns  a  real  number  between  0  and  1  to  every  element  z  in 
the  universe  of  discourse  Z.  This  number  mA(z)  indicates  the  degree  to  which  the  object 
or  data  z  belongs  to  the  fuzzy  set  A.  Equivalently,  mx(z)  defines  the  fit  (fuzzy  unit)  value 
[Kosko,  1986]  of  element  z  in  A. 

Fuzzy  membership  functions  can  have  different  shapes  depending  on  the  designer’s  pref¬ 
erence  or  experience.  In  practice  fuzzy  engineers  have  found  triangular  and  trapezoidal 
shapes  help  capture  the  modeler’s  sense  of  fuzzy  numbers  and  simplify  computation.  Fig¬ 
ure  3  shows  membership-function  graphs  of  the  fuzzy  subsets  above.  In  the  third  graph, 
for  example,  0  =  20°  is  Positive  Medium  to  degree  0.5,  but  only  Positive  Big  to  degree  0.3. 

In  Figure  3  the  fuzzy  sets  CE ,  VE,  and  ZE  are  narrower  than  the  other  fuzzy  sets. 
These  narrow  fuzzy  sets  permit  fine  control  near  the  loading  dock.  We  used  wider  fuzzy 
sets  to  describe  the  endpoints  of  the  range  of  the  fuzzy  variables  <f>,  x,  and  0.  The  wider 
fuzzy  sets  permitted  rough  control  far  from  the  loading  dock. 

Next  we  specified  the  fuzzy  “rulebase”  or  bank  of  fuzzy  associative  memory  (FAM)  rules. 
Fuzzy  associations  or  “rules”  ( A ,  B )  associate  output  fuzzy  sets  B  of  control  values  with 
input  fuzzy  sets  A  of  input-variable  values.  We  can  write  fuzzy  associations  as  antecedent- 
consequent  pairs  or  IF-THEN  statements. 

In  the  truck  backer-upper  case,  the  FAM  bank  contained  the  35  FAM  rules  in  Figure  4. 
For  example,  the  FAM  rule  of  the  left  upper  block  (FAM  rule  1)  corresponds  to  the  following 
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FIGURE  3  Fuzzy  membership  functions  for  each  linguistic  fuzzy-set  value. 

To  allow  finer  control,  the  fuzzy  sets  that  correspond  to  near  the  loading  dock 
are  narrower  than  the  fuzzy  sets  that  correspond  to  far  from  the  loading  dock. 

fuzzy  association: 

IF  x  =  LE  AND  A  =  RB,  THEN  9  =  PS. 

FAM  rule  18  indicates  that  if  the  truck  is  in  near  the  equilibrium  position,  then  the 
controller  should  not  produce  a  positive  or  negative  steering-angle  signal.  The  FAM  rules 
in  the  FAM-bank  matrix  reflect  the  symmetry  of  the  controlled  system. 

For  the  initial  condition  x  =  50  and  <f>  =  270,  the  fuzzy  truck  did  not  perform  well. 
The  symmetry  of  the  FAM  rules  and  the  fuzzy  sets  cancelled  the  fuzzy  controller  output  in 
a  rare  saddle  point.  For  this  initial  condition,  the  neural  controller  (and  truck-and-trailer 
below)  also  performed  poorly.  Any  perturbation  breaks  the  symmetry.  For  example,  the 
rule  (If  x  =  50  and  <}>  =  270,  then  9  —  5)  corrected  the  problem. 

The  three-dimensional  control  surfaces  in  Figure  5  show  steering-angle  signal  outputs 
6  that  correspond  to  all  combinations  of  values  of  the  two  input  state  variables  4>  and 
x.  The  control  surface  defines  the  fuzzy  controller.  In  this  simulation  the  correlation- 
minimum  FAM  inference  procedure,  discussed  in  [Kosko,  1990a],  determined  the  fuzzy 
control  surface.  If  the  control  surface  changes  with  sampled  variable  values,  the  system 
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FIGURE  4  FAM-bank  matrix  for  the  fuzzy  truck  backer-upper  controller. 


FIGURE  5  (a)  Control  surface  of  the  fuzzy  controller.  Fuzzy-set  values 

determined  the  input  and  output  combination  corresponding  to  FAM  rule  2 
(IF  x=LC  AND  FB t  THEN  #  =  PM).  (b)  Corresponding  control  surface  of 
the  neural  controller  for  constant  value  y=20. 
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behaves  as  an  adaptive  fuzzy  controller.  Below  we  demonstrate  unsupervised  adaptive 
control  of  the  truck  and  the  truck-and-trailer  systems. 

Finally,  we  determined  the  output  action  given  the  input  conditions.  We  used  the 
correlation-minimum  inference  method  illustrated  in  Figure  6.  Each  FAM  rule  produced 
the  output  fuzzy  set  clipped  at  the  degree  of  membership  determined  by  the  input  condi¬ 
tions  and  the  FAM  rule.  Alternatively,  correlation-product  inference  [Kosko,  1990a]  would 
combine  FAM  rules  multiplicatively.  Each  FAM  rule  emitted  a  fit- weighted  output  fuzzy 
set  Oi  at  each  iteration.  The  total  output  O  added  these  weighted  outputs: 


E-3. 

* 

(1) 

53  min  (fi,Si) 

(2) 

where  /,-  denotes  the  antecedent  fit  value  and  Si  represents  the  consequent  fuzzy  set  of 
steering-angle  values  in  the  ith  FAM  rule.  Earlier  fuzzy  systems  combined  the  output 
sets  Oi  with  pairwise  maxima.  But  this  tends  to  produce  a  uniform  output  set  O  as  the 
number  of  FAM  rules  increases.  Adding  the  output  sets  Oi  invokes  the  fuzzy  version  of 
the  Central  Limit  Theorem.  This  tends  to  produce  a  symmetric,  unimodal  output  fuzzy 
set  O  of  steering-angle  values. 

Fuzzy  systems  map  fuzzy  sets  to  fuzzy  sets.  The  fuzzy  control  system’s  output  defines 
the  fuzzy  set  O  of  steering-angle  values  at  each  iteration.  We  must  “defuzzify”  the  fuzzy 
set  O  to  produce  a  numerical  (point-estimate)  steering-angle  output  value  6. 

As  discussed  in  [Kosko,  1990a],  the  simplest  defuzzification  scheme  selects  the  value 
corresponding  to  the  maximum  fit  value  in  the  fuzzy  set.  This  mode-selection  approach 
ignores  most  of  the  information  in  the  output  fuzzy  set  and  requires  an  additional  decision 
algorithm  when  multiple  modes  occur. 

Centroid  defuzzification  provides  a  more  effective  procedure.  This  method  uses  the 
fuzzy  centroid  Q  as  output: 

Y  ei  mo{0j) 

0  =  3—p -  ,  (3) 

Ymo(Qj) 

i=  i 
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centroid  output 


FIGURE  6  Correlation-minimum  inference  with  centroid  defuzzification 
method.  Then  FAM-rule  antecedents  combined  with  AND  use  the  minimum 
fit  value  to  activate  consequents.  Those  combined  with  OR  would  use  the 
maximum  fit  value. 


where  0  defines  a  fuzzy  subset  of  the  steering-angle  universe  of  discourse  0  =  {0l5 . . . ,  6p}. 
The  central-limit-theorem  effect  produced  by  adding  output  fuzzy  set  Ot  benefits  both  max¬ 
mode  and  centroid  defuzzification.  Figure  6  shows  the  correlation-minimum  inference  and 
centroid  defuzzification  applied  to  FAM  rules  13  and  18.  We  used  centroid  defuzzification 
in  all  simulations. 

With  35  FAM  rules,  the  fuzzy  truck  controller  produced  successful  truck  backing-up 
trajectories  starting  from  any  initial  position.  Figure  7  shows  typical  examples  of  the  fuzzy- 
controlled  truck  trajectories  from  different  initial  positions.  The  fuzzy  control  system  did 
not  use  (“fire”)  all  FAM  rules  at  each  iteration.  Equivalently  most  output  consequent  sets 
are  empty.  In  most  cases  the  system  used  only  one  or  two  FAM  rules  at  each  iteration. 
The  system  used  at  most  4  FAM  rules  at  once. 


Neural  Truck  Backer-Upper  System 


The  neural  truck  backer-upper  of  Nguyen  and  Win  w  [1989]  consisted  of  multilayer 
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FIGURE  7  Sample  truck  trajectories  of  the  fuzzy  controller  for  initial 
positions  ( x,y,<f>):  (a)  (20,20,30),  (b)  (30,10,220),  and  (c)  (30,40,-10). 


feedforward  neural  networks  trained  with  the  backpropagation  gradient-descent  (stochastic- 
approximation)  algorithm.  The  neural  control  system  consisted  of  two  neural  networks: 
the  controller  network  and  the  truck  emulator  network.  The  controller  network  produced 
an  appropriate  steering-angle  signal  output  given  any  parking-lot  coordinates  (x,y),  and 
the  angle  <]>.  The  emulator  network  computed  the  next  position  of  the  truck.  The  emulator 
network  took  as  input  the  previous  truck  position  and  the  current  steering-angle  output 
computed  by  the  controller  network. 

We  did  not  train  the  emulator  network  since  we  could  not  obtain  “universal”  synaptic 
connection  weights  for  the  truck  emulator  network.  The  backpropagation  learning  algo¬ 
rithm  did  not  converge  for  some  sets  of  training  samples.  The  number  of  training  samples 
for  the  emulator  network  might  exceed  3000.  For  example,  the  combinations  of  training 
samples  of  a  given  angle  4>,  ®-position,  y-position,  and  steering  angle  signal  6  might  cor¬ 
respond  to  3150  (18  x  5  x  5  x  7)  samples  depending  on  the  division  of  the  input-output 
product  space.  Moreover,  the  training  samples  were  numerically  similar  since  the  neuronal 
signals  assumed  scaled  values  in  [0,1]  or  (—1,  lj.  For  example,  we  treated  close  values,  such 
as  0.40  and  0.41,  as  distinct  sample  values. 

Simple  kinematic  equations  replaced  the  truck  emulator  network.  If  the  truck  moved 
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backward  from  ( x,y )  to  ( z',y ')  at  an  iteration,  then 


x' 

=  x  +  rcos(^')  , 

(4) 

y' 

=  y  +  rsin(<£')  , 

(5) 

4>' 

=  <j>  +  0  • 

(6) 

r  denotes  the  fixed  driving  distance  of  the  truck  for  all  backing  movements.  We  used 
equations  (4)-(6)  instead  of  the  emulator  network.  This  did  not  affect  the  post-training 
performance  of  the  neural  truck  backer-upper  since  the  truck  emulator  network  back- 
propagated  only  errors. 

We  trained  only  the  controller  network  with  backpropagation.  The  controller  network 
used  24  “hidden”  neurons  with  logistic  sigmoid  functions.  In  the  training  of  the  truck- 
controller,  we  estimated  the  ideal  steering- angle  signal  at  each  stage  before  we  trained  the 
controller  network.  In  the  simulation,  we  used  the  arc-shaped  truck  trajectory  produced 
by  the  fuzzy  controller  as  the  ideal  trajectory.  The  fuzzy  controller  generated  each  training 
sample  ( x ,  y,  <^,  0)  at  each  iteration  of  the  backing-up  process.  We  used  35  training  sample 
vectors  and  needed  more  than  100,000  iterations  to  train  the  controller  network. 

Figure  5b  shows  the  resulting  neural  control  surface  for  y  =  20.  The  neural  control 
surface  shows  less  structure  than  the  corresponding  fuzzy  control  surface.  This  reflects 
the  unstructured  nature  of  black-box  supervised  learning.  Figure  8  shows  the  network 
connection  topology  for  our  neural  truck  backer-upper  control  system. 

Figure  9  shows  typical  examples  of  the  neural- controlled  truck  trajectories  from  sev¬ 
eral  initial  positions.  Even  though  we  trained  the  neural  network  to  follow  the  smooth 
arc-shaped  path,  some  learned  truck  trajectories  were  non-optimal. 

Comparison  of  Fuzzy  and  Neural  Systems 

As  shown  in  Figure  7  and  9,  the  fuzzy  controller  always  smoothly  backed  up  the  truck 
but  the  neural  controller  did  not.  The  neural-controlled  truck  sometimes  followed  an 


24  hidden  units 


FIGURE  8  Topology  of  our  neural  control  system. 


FIGURE  9  Sample  truck  trajectories  of  the  neural  controller  for  initial 
positions  (x,y,</>):  (a)  (20,20,30),  (b)  (30,10,220),  and  (c)  (30,40,-10). 


(a) 


(b) 


FIGURE  10  The  fuzzy  truck  trajectory  after  we  replaced  the  key  steady- 
state  FAM  rule  18  by  the  two  worst  rules:  (a)  IF  x  =  CE  AND  <f)  =  VE, 
THEN  9  =  PB,  and  (b)  IF  x  =  CE  AND  4>  =  VE ,  THEN  9  =  NB. 

Training  the  neural  control  system  was  time-consuming.  The  backpropagation  algo¬ 
rithm  required  thousands  of  back-ups  to  train  the  controller  network.  In  some  cases,  the 
learning  algorithm  did  not  converge. 

We  “trained”  the  fuzzy  controller  by  encoding  our  own  common  sense  FAM  rules.  Once 
we  develop  the  FAM-rule  bank,  we  can  compute  control  outputs  from  the  resulting  FAM- 
bank  matrix  or  control  surface.  The  fuzzy  controller  did  not  need  a  truck  emulator  and 
did  not  require  a  math  model  of  how  outputs  depended  on  inputs. 

The  fuzzy  controller  was  computationally  lighter  than  the  neural  controller.  Most 
computation  operations  in  the  neural  controller  involved  the  multiplication,  addition,  or 
logarithm  of  two  real  numbers.  In  the  fuzzy  controller,  most  computational  operations 
involved  comparing  and  adding  two  real  numbers. 


Sensitivity  Analysis 

We  studied  the  sensitivity  of  the  fuzzy  controller  in  two  ways.  We  replaced  the  FAM 
rules  with  destructive  or  “sabotage”  FAM  rules,  and  we  randomly  removed  FAM  rules. 
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(a)  (b) 


FIGURE  11  Fuzzy  truck  trajectory  when  (a)  no  FAM  rules  are  removed 
and  (b)  FAM  rules  7,  13,  18  and  23  are  removed. 

We  deliberately  chose  sabotage  FAM  rules  to  confound  the  system.  Figure  10  shows  the 
trajectory  when  two  sabotage  FAM  rules  replaced  the  important  steady-state  FAM  rule — 
FAM  rule  18:  the  fuzzy  controller  should  produce  zero  output  when  the  truck  is  nearly  in 
the  correct  parking  position.  Figure  11  shows  the  truck  trajectory  after  we  removed  four 
randomly  chosen  FAM  rules  (7,  13,  18,  and  23).  These  perturbations  did  not  significantly 
affect  the  fuzzy  controller’s  performance. 

We  studied  robustness  of  each  controller  by  examining  failure  rates.  For  the  fuzzy 
controller  we  removed  fixed  percentages  of  randomly  selected  FAM  rules  from  the  system. 
For  the  neural  controller  we  removed  training  data.  Figure  12  shows  performance  errors 
averaged  over  ten  typical  back-ups  with  missing  FAM  rules  for  the  fuzzy  controller  and 
missing  training  data  for  the  neural  controller.  The  missing  FAM  rules  and  training  data 
ranged  from  0  %  to  100  %  of  the  total.  In  Figure  12a,  the  docking  error  equaled  the 
Euclidean  distance  from  the  actual  final  position  ( <j> ,  x,  y )  to  the  desired  final  position  (<f>f, 
xh  Vf): 

Docking  Error  =  yj(<f>f  -  <f> )2  +  (xf  -  x)7  +  (yf  -  y)2  .  (7) 

In  Figure  12b,  the  trajectory  error  equaled  the  ratio  of  the  actual  trajectory  length  of  the 

lb 


Docking  error 


%  FAM  rule*  removed 


Trajectory  error 


(a)  Fuzzy  controller 


Docking  error  Trajectory  error 


(b)  BP-Ncural  controller 


FIGURE  12  Comparison  of  robustness  of  the  controllers:  (a)  Docking  and 
Trajectory  error  of  the  fuzzy  controller,  (b)  Decking  and  Trajectory  error  of 
the  neural  controller. 


truck  divided  by  the  straight  line  distance  to  the  loading  dock: 


Trajectory  Error 


length  of  truck  trajectory 
distance(initial  position,  desired  final  position) 


(8) 


Adaptive  Fuzzy  Truck  Backer-Upper 


Adaptive  FAM  (AFAM)  systems  generate  FAM  rules  directly  from  training  data.  A 

one-dimensional  FAM  system,  S  :  /"  - »  Ip,  defines  a  FAM  rule,  a  single  association  of  the 

form  ( ,  B{).  In  this  case  the  input-output  product  space  equals  In  X  Ip.  As  discussed  in 
[Kosko,  1990a],  a  FAM  rule  (A,  R«)  defines  a  cluster  or  ball  of  points  in  the  product-space 
cube  /"  X  Ip  centered  at  the  point  (A,,  H,).  Adaptive  clustering  algorithms  can  estimate  the 
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unknown  FAM  rule  (A,-,  2?,)  from  training  samples  in  R2.  We  used  differential  competitive 
learning  (DCL)  to  recover  the  bank  of  FAM  rules  that  generated  the  truck  training  data. 

We  generated  2230  truck  samples  from  7  different  initial  positions  and  varying  an¬ 
gles.  We  chose  the  initial  positions  (20,20),  (30,20),  (45,20),  (50,20),  (55,20),  (70,20),  and 
(80,20).  We  changed  the  angle  from  —60°  to  240°  at  each  initial  position.  At  each  step,  the 
fuzzy  controller  produced  output  steering  angle  8.  The  training  vectors  (x,  <£,  6)  defined 
points  in  a  three-dimensional  product-space,  x  had  5  fuzzy  set  values:  LE,  LC ,  CE,  RC , 
and  RI.  <f>  had  7  fuzzy  set  values:  RB,  RU ,  RV,  VE ,  LV,  LU,  and  LB.  8  had  7  fuzzy  set 
values:  NB,  NM ,  NS ,  ZE,  PS ,  PAT,  and  PB.  So  there  were  245  (5  x  7  x  7)  possible 
FAM  cells. 

We  defined  FAM  cells  by  partitioning  the  effective  product-space.  FAM  cells  near  the 
center  were  smaller  than  outer  FAM  cells  because  we  chose  narrow  membership  functions 
near  the  steady-state  FAM  cell.  Uniform  partitions  of  the  product-space  produced  poor 
estimates  of  the  original  FAM  rules.  As  in  Figure  3,  this  reflected  the  need  to  judiciously 
define  the  fuzzy-set  values  of  the  system  fuzzy  variables. 

We  performed  product-space  clustering  with  the  version  of  DCL  discussed  in  [Kosko, 
1990a).  If  a  FAM  cell  contained  at  least  one  of  the  245  synaptic  quantization  vectors,  we 
entered  the  corresponding  FAM  rule  in  the  FAM  matrix. 

Figure  13a  shows  the  input  sample  distribution  of  ( x,<j> ).  We  did  not  include  the 
variable  8  in  the  figure.  Training  data  clustered  near  the  steady-state  position  (x  =  50 
and  4>  —  90°).  Figure  13b  displays  the  synaptic- vector  histogram  after  DCL  classified  2230 
training  vectors  for  35  FAM  rules.  Since  successful  FAM  system  generated  the  training 
samples,  most  training  samples,  and  thus  most  synaptic  vectors,  clustered  in  the  steady- 
state  FAM  cell. 

DCL  product-space  clustering  estimated  35  new  FAM  rules.  Figure  14  shows  the  DCL- 
estimated  FAM  bank  and  the  corresponding  control  surface.  The  DCL-estimated  control 
surface  visually  resembles  the  underlying  unknown  control  surface  in  Figure  5a.  The  two 
systems  produce  nearly  equivalent  truck-backing  behavior.  This  suggests  adaptive  product- 
space  clustering  can  estimate  the  FAM  rules  underlying  expert  behavior  in  many  cases, 
even  when  the  expert  or  fuzzy  engineer  cannot  articulate  the  FAM  rules. 
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(a)  Input  data  distribution 


(b)  Synaptic-vector  histogram 


FIGURE  13  (a)  Input  data  distribution,  (b)  Synaptic-vector  histogram. 

Differential  competitive  learning  allocated  synaptic  quantization  vectors  to 
FAM  cells.  The  steady-state  FAM  cell  ( CE ,  VE ;  ZE)  contained  the  most 
synaptic  vectors. 
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FIGURE  14  (a)  DCL-estimated  FAM  bank,  (b)  Corresponding  control 

surface. 
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FIGURE  15  (a)  FAM  bank  generated  by  the  neural  control  surface  in 

Figure  5b.  (b)  Control  surface  of  the  neural  BP-AFAM  system  in  (a). 


We  also  used  the  neural  control  surface  in  Figure  5b  to  estimate  FAM  rules.  We  divided 
the  input-output  product-space  into  FAM  cells  as  in  the  fuzzy  control  case.  If  the  neural 
control  surface  intersected  the  FAM  cell,  we  entered  the  corresponding  FAM  rule  in  a  FAM 
bank.  We  averaged  all  neural  control-surface  values  in  a  square  region  over  the  two  input 
variables  x  and  <f>.  We  assigned  the  average  value  to  one  of  7  output  fuzzy  sets.  Figure  15 
shows  the  resulting  FAM  bank  and  corresponding  control  surface  generated  by  the  neural 
control  surface  in  Figure  5b.  This  new  control  surface  resembles  the  original  fuzzy  control 
surface  in  Figure  5a  more  than  it  resembles  the  neural  control  surface  in  Figure  5b.  Note 
the  absence  of  a  steady-state  FAM  rule  in  the  FAM  matrix  in  Figure  5a. 

Figure  16  compares  the  DCL-AFAM  and  BP-AFAM  control  surfaces  with  the  fuzzy 
control  surface  in  Figure  5a.  Figure  16  shows  the  absolute  difference  of  the  control  surfaces. 
As  expected,  the  DCL-AFAM  system  produced  less  absolute  error  than  the  BP-AFAM 
system  produced. 

Figure  17  shows  the  docking  and  trajectory  errors  of  the  two  AFAM  control  systems. 
The  DCL-AFAM  system  produced  less  docking  error  than  the  BP-AFAM  system  produced 
for  100  arbitrary  backing-up  trials.  The  two  AFAM  systems  generated  similar  backing-up 
trajectories.  This  suggests  that  black-box  neural  estimators  can  defir.  :  the  front-end  of 
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FIGURE  16  (a)  Absolute  difference  of  the  FAM  surface  in  Figure  5a  and 

the  DCL-estimated  FAM  surface  in  Figure  14b.  (b)  Absolute  difference  of  the 
FAM  surface  in  Figure  5a  and  the  neural-estimated  FAM  surface  in  Figure  15b. 

FAM-structured  systems.  In  principle  we  can  use  this  technique  to  generate  structured 
FAM  rules  for  any  neural  application.  We  can  then  inspect  and  refine  these  rules  and 
perhaps  replace  the  original  neural  system  with  the  tuned  FAM  system. 

Fuzzy  Truck-and-Trailer  Controller 

We  added  a  trailer  to  the  truck  system,  as  in  the  original  Nguyen- Widrow  model. 
Figure  18  shows  the  simulated  truck-and-trailer  system.  We  added  one  more  variable  (cab 
angle,  <f>c)  to  the  three  state  variables  of  the  trailerless  truck.  In  this  case  a  FAM  rule  takes 
the  form 

IF  x  =  LE  AND  4>t  =  RB  AND  &  =  PO,  THEN  (3  =  NS. 

The  four  state  variables  x,  y,  <pt,  and  <f>c  determined  the  position  of  the  truck-and-trailer 
system  in  the  plane.  Fuzzy  variable  (f>t  corresponded  to  <f>  for  the  trailerless  truck.  Fuzzy 
variable  <f>c  specified  the  relative  cab  angle  with  respect  to  the  center  line  along  the  trailer. 
<j)c  ranged  from  —90°  to  90°.  The  extreme  cab  angles  90°  and  —90°  corresponded  to  two 


(a)  Docking  Error 


-  BP-AFAM  (dashed) :  mean  =  6.6863,  s.d.  =  1.0665 


-  DCL-AFAM  (solid)  :  mean  =  1.1075,  s.d.  =  0.0839 

-  BP-AFAM  (dashed)  :  mean  =  1.1453,  S.<L  =  0.1016 


FIGURE  17  (a)  Docking  errors  and  (b)  Trajectory  errors  of  the  DCL- 

AFAM  and  BP-AFAM  control  systems. 
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(*  •  y)  (i ,  j)  :  Cartesian  coordinate  of  the  rear  end,  (0,100). 

(u  ,  »)  :  Cartesian  coordinate  of  the  joint. 

<|>l  :  Angle  of  the  trailer  with  horizontal,  (0,360). 

4>c  :  Relative  angle  of  the  cab  with  trailer,  (-90,90). 

6  :  Steering  angle,  (-3030). 

P  :  Angle  of  the  trailer  updated  at  each  step,  (-30,30). 

FIGURE  18  Diagram  of  the  simulated  truck-and-trailer  system. 

“jackknife”  positions  of  the  cab  with  respect  to  the  trailer.  Positive  <f>e  value  indicated 
that  the  cab  resided  on  the  left-hand  side  of  the  trailer.  Negative  value  indicated  that  it 
resided  on  the  right-hand  side.  Figure  18  shows  a  positive  angle  value  of  <pc. 

Fuzzy  variables  x,  <f>t,  and  <f>c  defined  the  input  variables.  Fuzzy  variable  (3  defined  the 
output  variable.  (3  measured  the  angle  that  we  needed  to  update  the  trailer  at  each  itera¬ 
tion.  We  computed  the  steering-angle  output  6  with  the  following  geometric  relationship. 
With  the  output  /3  value  computed,  the  trailer  position  (x,y)  moved  to  the  new  position 

(x\  y0: 

x'  ~  x  +  r  cos(<f>t  +  (3),  (9) 

y'  =  y  +  rsm(<j>t  +  P),  (10) 

where  r  denotes  a  fixed  backing  distance.  Then  the  joint  of  the  cab  and  the  trailer  («,  u) 
moved  to  the  new  position  {u\  v'): 

u  =  x  —  lcos((f>t  +  P),  (11) 

v'  =  y  -  Csin(4>t  +  P),  (12) 

where  l  denotes  the  trailer  length.  We  updated  the  directional  vector  ( dirU,dirV ),  which 
defined  the  cab  angle,  by 

dirU'  ~  dirU  +  Au, 

=  dirV  -f  Au, 


dirV' 
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(13) 

(14) 


FIGURE  19  Membership  graphs  of  the  three  fuzzy-set  values  of  fuzzy 
variable  <f>e. 


where  A u  =  u'  —  u,  and  Av  =  v'  —  v.  The  new  directional  vector  (dirU1 ,  dirV')  defines 
the  new  cab  angle  <j>'c.  Then  we  obtain  the  steering  angle  value  a s  0  —  <f>'ch  —  <j>ch ,  where 
<i>c,h  denotes  the  cab  angle  with  the  horizontal.  We  chose  the  same  fuzzy-set  values  and 
membership  functions  for  (3  as  we  chose  for  9.  f3  ranged  from  —30°  to  30°.  We  chose  the 
fuzzy-set  values  of  <f>c  as  NE,  ZR  and  PO  as  in  Figure  19. 

Figure  20  displays  the  5  FAM-rule  matrices  in  the  FAM  bank  of  the  fuzzy  truck-and- 
trailer  system.  In  Figure  20  we  fixed  the  fuzzy  variable  x  as  LE ,  LC ,  CE,  RG ,  and  RI. 
There  were  735  (7  x  5  x  7  x  3)  possible  FAM  rules  and  only  105  actual  FAM  rules. 

Figure  21  shows  typical  backing-up  trajectories  of  the  fuzzy  truck-and-trailer  control 
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FIGURE  21  Sample  truck-and-trailer  trajectories  from  the  fuzzy  con¬ 
troller  for  initial  positions  (x,  y,  <f>t ,  (a)  (25,  30,  —20,  30),  (b)  (80,  30, 

210,  -40),  and  (c)  (70,  30,  200,  30). 


system  from  different  initial  positions.  The  truck-and-trailer  backed  up  in  different  direc¬ 
tions  depending  on  the  relative  position  of  the  cab  with  respect  to  the  trailer.  The  fuzzy 
control  systems  successfully  controlled  the  truck-and-trailer  in  jackknife  positions. 


BP  Truck-and-TVailer  Control  Systems 

We  added  the  cab-angle  variable  <f)c  as  to  the  backpropagation-trained  neural  truck  con¬ 
troller  as  an  input.  The  controller  network  contained  24  hidden  neurons  with  output  vari¬ 
able  (3.  The  training  samples  consisted  of  5-dimensional  space  of  the  form  (x,y,<f>t,<f>c,(3). 
We  trained  the  controller  network  with  52  training  samples  from  the  fuzzy  controller:  26 
samples  for  the  left  half  of  the  plane,  26  samples  for  the  right  half  of  the  plane.  We 
used  equations  (9)-(14)  instead  of  the  emulator  network.  Training  required  more  than 
200,000  iterations.  Some  training  sequences  did  not  converge.  The  BP-trained  controller 
performed  well  except  in  a  few  cases.  Figure  22  shows  typical  backing-up  trajectories  of 
the  BP  truck-and-trailer  control  system  from  the  same  initial  positions  used  in  Figure  21. 

We  performed  the  same  robustness  tests  for  the  fuzzy  and  BP-trained  truck-and-trailer 
controllers  as  in  the  trailerless  truck  case.  Figure  23  shows  performance  errors  averaged 
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FIGURE  22  Sample  truck-and-trailer  trajectories  of  the  BP-trained  con¬ 
troller  for  initial  positions  ( x ,  y,  <f>t,  <f>t ):  (a)  (25,  30,  —20,  30),  (b)  (80,  30,  210, 

-40),  and  (c)  (70,  30,  200,  30). 

over  ten  typical  back-ups  from  ten  different  initial  positions.  These  performance  graphs 
resemble  closely  the  performance  graphs  for  the  trailerless  truck  systems  in  Figure  12. 


AFAM  Truck-and-Trailer  Control  Systems 

We  generated  6250  truck-and-trailer  data  using  the  original  FAM  system  in  Figure  20. 
We  backed  up  the  truck-and-trailer  from  the  same  initial  positions  as  in  the  trailerless  truck 
case.  The  trader  angle  <f>t  ranged  from  -60°  to  240°,  and  the  cab  angle  <j>c  assumed  only 
the  three  values  -45°,  0°,  and  45°.  The  training  vectors  (x,  <f>t ,  <j>c ,  /?)  defined  points  in  the 
four- dimensional  input-output  product-space.  We  nonuniformly  partitioned  the  product 
space  into  FAM  cells  to  allow  narrower  fuzzy-set  values  near  the  steady-state  FAM  cell. 

We  used  DCL  to  train  the  AFAM  truck-and-trailer  controller.  The  total  number  of  FAM 
cells  equaled  735  (7  x  5  x  7  x  3).  We  used  735  synaptic  quantization  vectors.  The  DCL 
algorithm  classified  the  6250  data  into  105  FAM  cells.  Figure  24  shows  the  synaptic-vector 
histogram  corresponding  to  the  105  FAM  rules.  Figure  25  shows  the  estimated  FAM  bank 
by  the  DCL  algorithm.  Figure  26  shows  the  original  and  DCL-estimated  control  surfaces 
for  the  fuzzy  truck-and-trailer  systems. 
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Docking  error 


Trajectory  error 


Docking  error 


(a)  Fuzzy  truck-and-trailer 

Trajectory  error 


%  Training  Samples  removed 

(b)  BP-Neural  truck-and-trailer 


FIGURE  23  Comparison  of  robustness  of  the  two  truck-and-trailer  con¬ 
trollers:  (a)  Docking  and  trajectory  error  of  the  fuzzy  controller,  (b)  Docking 
and  trajectory  error  of  the  BP  controller. 


%  Classification 


FIGURE  24  Synaptic-vector  histogram  for  the  AFAM  truck-and-trailer 
system. 
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FIGURE  25  DCL-estimated  FAM  bank  for  the  AFAM  truck-and-trailer 
system. 


Figure  27  shows  the  trajectories  of  the  original  FAM  and  the  DCL-estimated  AFAM 
truck-and-trailer  controllers.  Figure  27a  and  27b  show  the  two  trajectories  from  the  initial 
position  (x,  y,  <f>u  <f>c)  =  (30,30,10,45).  Figure  27c  and  27d  show  the  trajectories  from 
initial  position  (60,30,210,-60).  The  original  FAM  and  DCL-estimated  AFAM  systems 
exhibited  comparable  truck-and-trailer  control  performance  except  in  a  few  cases,  where 
the  DCL-estimated  AFAM  trajectories  were  irregular. 
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x  =  RC  x  =  RI 

(a)  Original  control  surfaces  for  the  truck-and-trailer  system 


(b)  DCL-estimated  control  surfaces  for  the  truck-and-trailer  system 


FIGURE  26  (a)  Original  control  surface  (b)  DCL-estimated  control  surface 
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(a)  Original  FAM 


(l>)  DCL-esti  mated  FAM 


FIGURE  27  Sample  truck-and-trailer  trajectories  from  the  original  and 
the  DCL-estiinated  FAM  systems  starting  at  initial  positions  (sc,  y,  <f>t,  <f>c)  — 
(30,30,10,45)  and  (60,30,210,-60). 

Conclusion 

We  quickly  engineered  fuzzy  systems  to  successfully  back  up  a  truck  and  truck-and- 
trailer  system  in  a  parking  lot.  We  used  only  common  sense  and  error-nulling  intuitions 
to  generate  sufficient  banks  of  FAM  rules.  These  systems  performed  "ell  until  we  removed 
over  50  %  of  the  FAM  rules.  This  extreme  robustness  suggests  that,  for  many  estimation 
and  control  problems,  different  fuzzy  engineers  can  rapidly  develop  prototype  fuzzy  systems 
that  perform  similarly  and  well. 

The  speed  with  which  the  DCL  clustering  technique  recovers  the  underlying  FAM  bank 
further  suggests  that  we  can  likewise  construct  fuzzy  systems  for  more  complex,  higher- 
dimensional  problems.  For  these  problems  we  may  have  access  to  only  incomplete  numer¬ 
ical  input  output  data.  Pure  neural-network  or  statistical-proccss-control  approaches  may 
generate  systems  with  comparable  performance.  But  these  systems  will  involve  far  greater 
computational  effort,  will  be  more  difficult  to  modify,  and  will  not  provide  a  structured 
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representation  of  the  system’s  throughput. 

Our  neural  experiments  suggests  that  whenever  we  model  a  system  with  a  neural  net¬ 
work,  for  little  extra  computational  cost  we  can  generate  a  set  of  structured  FAM  rules  that 
approximate  the  neural  system’s  behavior.  We  can  then  tune  the  fuzzy  system  by  refining 
the  FAM-rule  bank  with  fuzzy-engineering  rules  of  thumb  and  with  further  training  data. 
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APPENDIX:  Product-space  Clustering  with  Differential 

Competitive  Learning 


Product-space  clustering  [Kosko,  1990a]  is  a  form  of  stochastic  adaptive  vector  quanti¬ 
zation.  Adaptive  vector  quantization  (AVQ)  systems  adaptively  quantize  pattern  clusters 
in  J2".  Stochastic  competitive  learning  systems  are  neural  AVQ  systems.  Neurons  compete 
for  the  activation  induced  by  randomly  sampled  patterns.  The  corresponding  synaptic  fan- 
in  vectors  adaptively  quantize  the  pattern  space  Rn.  The  p  synaptic  vectors  define  the 
p  columns  of  the  synaptic  connection  matrix  M.  M  interconnects  the  n  input  or  linear 
neurons  in  the  input  neuronal  field  Fx  to  the  p  competing  nonlinear  neurons  in  the  output 
field  Fy.  Figure  28  below  illustrates  the  neural  network  topology. 

Learning  algorithms  estimate  the  unknown  probability  density  function  p(x),  which  de¬ 
scribes  the  distribution  of  patterns  in  Rn.  More  synaptic  vectors  arrive  at  more  probable 
regions.  Where  sample  vectors  x  are  dense  or  sparse,  synaptic  vectors  should  be  dense 
or  sparse.  The  local  count  of  synaptic  vectors  then  gives  a  nonparametric  estimate  of  the 
volume  probability  P(V)  for  volume  V  C  Rn: 

P(V)  =  p(x)  dx  (15) 

Number  of  m,  (E  V 
«  - - - 

P 

In  the  extreme  case  that  V  —  Rn,  this  approximation  gives  P(V)  =  p/p  =  1.  For  improb¬ 
able  subsets  V,  P(V)  =  0/p  =  0. 


(16) 
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Stochastic  Competitive  Learning  Algorithms 


The  metaphor  of  competing  neurons  reduces  to  nearest- neighbor  classification.  The 
AVQ  system  compares  the  current  vector  random  sample  x(£)  in  Euclidean  distance  to  the 
p  columns  of  the  synaptic  connection  matrix  M,  to  the  p  synaptic  vectors  m i(t), . . . ,  m p(t). 
If  the  _/th  synaptic  vector  m_,(£)  is  closest  to  x(£),  then  the  jth  output  neuron  “wins”  the 
competition  for  activation  at  time  t.  In  practice  we  sometimes  define  the  nearest  N  synaptic 
vectors  as  winners.  Some  scaled  form  of  x(£)  —  mj(£)  updates  the  nearest  or  “winning” 
synaptic  vectors.  “Losers”  remain  unchanged:  m i(t  +  1)  =  m;(t).  Competitive  synaptic 
vectors  converge  to  pattern-class  centroids  exponentially  fast  [Kosko,  1990b]. 

The  following  three-step  process  describes  the  competitive  AVQ  algorithm,  where  the 
third  step  depends  on  which  learning  algorithm  updates  the  winning  synaptic  vectors. 


Competitive  AVQ  Algorithm 


1.  Initialize  synaptic  vectors:  m,(0)  =  x(t),  i  —  l,...,p. 

Sample-dependent  initialization  avoids  many  pathologies  that  can  distort  nearest- 
neighbor  learning. 

2.  For  random  sample  x(t),  find  the  closest  or  “winning”  synaptic  vector  m;(f): 

|(mi(£)-x(£)||  =  min  ||m,(£)  —  x(t)||  ,  (17) 

X 

where  ||x||2  =  x \  + - f  x2n  defines  the  squared  Euclidean  vector  norm  of  x.  We  can 

define  the  N  synaptic  vectors  closest  to  x  as  “winners”. 

3.  Update  the  winning  synaptic  vector(s)  m j(t)  with  an  appropriate  learning  algorithm. 
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Differential  competitive  learning  (DCL) 


Differential  competitive  “synapses”  learn  only  if  the  competing  “neuron”  changes  its 
competitive  status  (Kosko,  1990c]: 

rhij  =  Sjiyj)  [  Si(xi)  -  TTiij  ]  ,  (18) 

or  in  vector  notation, 

=  Sj(yj)  { S(x)  -nij]  ,  (19) 

where  S(x)  =  (-Si(xi),  •  .  • ,  5„(a:n))  and  mj  =  {m\jy . . . , mnj- ).  m,y  denotes  the  synaptic 
weight  between  the  ith  neuron  in  input  field  Fx  and  the  j th  neuron  in  competitive  field 
Fy.  Nonnegative  signal  functions  5,-  and  Sj  transduce  the  real-valued  activations  Zi  and 
yj  into  bounded  monotone  nondecreasing  signals  Si(xi)  and  Sj(yj).  rhij  and  Sj(yj)  denote 
the  time  derivatives  of  mtJ  and  Sj( yj),  synaptic  and  signal  velocities.  Sj(yj)  measures  the 
competitive  status  of  the  jth  competing  neuron  in  Fy.  Usually  Sj  approximates  a  binary 
threshold  function.  For  example,  Sj  may  equal  a  steep  binary  logistic  sigmoid, 

«»)  =  TT^7  -  <20> 

for  some  constant  c  >  0.  The  jth  neuron  wins  the  laterally  inhibitive  competition  if  Sj  —  1, 
loses  if  Sj  =  0. 

For  discrete  implementation,  we  use  the  DCL  algorithm  as  a  stochastic  difference  equa¬ 
tion  [Kong,  1991]: 

rn,(t  +  1)  =  mj(£)  -f  ct  ASj(yj(t))  [  5(x(t))  -  mj(f)  ]  if  the  jth  neuron  wins,  (21) 
m,-(t  +  1)  =  m,(t)  if  the  zth  neuron  loses.  (22) 

ASj(yj(t))  denotes  the  time  change  of  the  jth  neuron’s  competition  signal  Sj(yj)  in  the 
competitive  field  Fy. 

&Sj(yj(t))  =  sgn[  Sj(yj(t  +  1))  -Sj{yj{t))}  .  (23) 


34 


We  define  the  signum  operator  sgn(x)  as 

if  x  >  0 

if  x  =  0  .  (24) 

if  x  <  0 

{q}  denotes  a  slowly  decreasing  sequence  of  learning  coefficients,  such  as  Ct  =  .1  (1  — 
t/2000)  for  2000  training  samples.  Stochastic  approximation  [Huber,  1981]  requires  a  de¬ 
creasing  gain  sequence  {cj}  to  suppress  random  disturbances  and  to  guarantee  convergence 
to  local  minima  of  mean-squared  performance  measures.  The  learning  coefficients  should 
decrease  slowly, 

£<*  =  oo  ,  (25) 

t- i 

but  not  too  slowly, 

OO 

£  ct  <  oo  .  (26) 

t= l 

Harmonic-series  coefficients,  Ct  =  1/i,  satisfy  these  constraints. 

We  approximate  the  competitive  signal  difference  A Sj  as  the  activation  difference  Ay,-: 


&Sj(yj(t))  =  sgn[  Vj(t  +  1)  -  yj(t)  ]  (27) 

=  Ay,(t)  .  (28) 

Input  neurons  in  feedforward  networks  usually  behave  linearly:  5j(x,)  =  x;,  or  5(x(t))  = 
x(<).  Then  we  update  the  winning  synaptic  vector  m,-(f)  with 

m j(t  +  1)  =  01,(6)  +  ct  A y,(6)  [  x(6)  -  m,(6)  ]  .  (29) 

We  update  the  Fy  neuronal  activations  y,  with  the  additive  model 

y}{t+L)  =  y,-(t)  +  £S.-(*,-(i))  +  £  Sk(yk(t))  wkj  .  (30) 

i  k 
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Input  field  Fx  Competition  field  Fy 

FIGURE  28  Topology  of  the  laterally  inhibitive  DCL  network. 

For  linear  signal  functions  the  first  sum  in  (30)  reduces  to  an  inner  product  of  sample 
and  sjnaptic  vectors: 

12  x*(0  ^(0  =  *T(0  «v(0  •  (31) 

i 

Then  positive  learning  tends  to  occur — A m,ij  >  0 — when  x  is  close  to  the  jth  synaptic 
vector 

Since  a  binary  threshold  function  approximates  the  output  signal  function  Sfc(y*,),  the 
second  sum  in  (30)  sums  over  just  the  winning  neurons:  WkJ  f°r  winning  neurons  yk  . 

k 

The  p  x  p  matrix  W  contains  the  Fy  within-field  synaptic  connection  strengths.  Di¬ 
agonal  elements  wa  are  positive,  off-diagonal  elements  negative.  Winning  neurons  excite 
themselves  and  inhibit  all  other  neurons.  Figure  28  shows  the  connection  topology  of  the 
laterally  inhibitive  DCL  network. 
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Product-space  clustering 


We  divided  the  space  0  <  x  <  100  into  five  nonuniform  intervals  [0,32.5],  [32.5,47.5], 

[47.5.52.5] ,  [52.5,67.5],  and  [67.5,100].  Each  interval  represented  the  five  fuzzy-set  values 
LE,  LG ,  CE ,  RG,  and  RI.  This  choice  corresponded  to  the  nonoverlapping  intervals 
of  the  fuzzy  membership  function  graphs  m(*)  in  Figure  3.  Similarly,  we  divided  the 
space  —90  <  <f>  <  270  into  seven  nonuniform  intervals  [—90,0],  [0,66.5],  [66.5,86],  [86,94], 

[94.113.5] ,  [113.5,182.5],  and  [182.5,270],  which  corresponded  respectively  to  RB,  RU , 
RV ,  VE ,  LV ,  LU ,  and  LB.  We  divided  the  space  —30  <  6  <  30  into  seven  nonuniform 
intervals  [-30,-20],  [-20, -7.5],  [-7.5, -2.5],  [-2.5, 2.5],  [2.5, 7.5],  [7.5,20],  and  [20,30], 
which  corresponded  to  NB,  NM ,  NS,  ZE,  PS,  PM,  and  PB. 

DCL  classified  each  input-output  data  vector  into  one  of  the  FAM  cells.  We  added  a 
FAM  rule  to  the  FAM  bank  if  the  DCL-trained  synaptic  vector  fell  in  the  FAM  cell.  In 
case  of  ties  we  chose  the  FAM  cell  with  the  most  densely  clustered  data. 

For  the  BP-AFAM  generated  from  the  neural  control  surface  in  Figure  15,  we  divided 
the  rectangle  [0,100]  x  [—90,270]  into  35  nonuniform  squares  with  the  same  divisions 
defined  above.  Then  we  added  and  averaged  the  control  surface  values  in  the  square.  We 
added  a  FAM  rule  to  the  FAM  bank  if  the  averaged  value  corresponded  to  one  of  the  seven 
FAM  cells. 

For  the  truck-and-trailer  case,  we  divided  the  space  —90  <  <f>c  <  90  into  three  intervals 
[-90,-12.5],  [—12.5, 12.5],  and  [12.5,90],  which  corresponded  to  NE,  ZR,  and  PO.  There 
were  735  FAM  cells,  and  735  possible  FAM  rules,  of  the  form  (z,<f>t,<f>c;/3). 
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Differential  Competitive  Learning  for  Centroid  Estimation  and  Phoneme 

Recognition 


SFONG-GON  KONG  and  BART  KOSKO 


NN/Deccmbcr  1990  IEEE 


Abstract— We  compared  a  differential  competitive  learning  (DCL) 
system  with  two  so|M*rvised  competitive  learning  (SCI.)  systems  for 
centroid  estimation  and  phoneme  recognition.  1)0.  provides  a  new 
form  of  unsupervised  adaptive  vector  quantisation.  Standard  stochas¬ 
tic  competitive  learning  systems  learn  only  if  neurons  win  a  competi¬ 
tion  for  activation  induced  by  randomly  sampled  patterns.  DCL  sys¬ 
tems  learn  only  if  the  competing  neurons  change  their  competitive 
signal.  Signal-velocity  information  provides  unsupervised  local  rein¬ 
forcement  during  learning.  The  sign  of  the  neuronal  signal  derivative 
rewards  winners  and  punishes  losers.  Standard  competitive  learning 
ignores  instantaneous  win-rate  information.  Synaptic  fan-in  vectors 
adaptively  quantize  the  randomly  sampled  pattern  space  into  nearest- 
neighbor  decision  classes.  More  generally,  the  synaptic  vector  distri¬ 
bution  estimates  the  unknown  sampled  probability  density  function 
pix).  Simulations  showed  that  unsupervised  DCL-trained  synaptic 
vectors  converged  to  class  centroids  at  least  as  fast  as,  and  wandered 
less  about  these  centroids  than,SCI:trained  synaptic  vectors.  Simula¬ 
tions  on  a  small  set  of  English  phonemes  favored  DCL  over  SCL  for 
classification  accuracy. 


I.  Adaptive  Vector  Quantization  for  Phoneme 
Recognition 

PHONEME  recognition  is  a  simple  form  of  speech  rec¬ 
ognition.  We  can  recognize  a  speech  sample  phoneme 
by  phoneme.  The  phoneme  recognition  system  learns  only 
a  comparatively  small  set  of  minimal  syllables  or  pho¬ 
nemes.  More  advanced  systems  learn  and  recognize 
words,  phrases,  or  sentences.  There  are  orders  of  mag¬ 
nitude  more  such  speech  units  thsn  phonemes.  Words 
and  phrases  can  also  undergo  more  complex  forms  of  dis¬ 
tortion  and  time  warping. 

In  principle,  we  can  recognize  phonemes  and  speech 
with  vector  quantization  methods.  These  methods  search 
fora  small  but  representative  set  of  prototypes,  which  we 
can  then  use  to  match  sample  patterns  with  nearest-neigh¬ 
bor  techniques. 

In  neural  network  phoneme  recognition,  a  sequence  of 
discrete  phonemes  from  a  continuous  speech  sample  pro¬ 
duces  a  series  of  neuronal  responses.  Kohonen’s  [4]  su¬ 
pervised  neural  phoneme  recognition  system  successfully 
classifies  21  Finnish  phonemes.  This  stochastic  competi¬ 
tive  learning  system  behaves  as  an  adaptive  vector  quan¬ 
tization  system. 
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Traditional  vector  quantization  systems  may  attempt  to 
minimize  a  mean-squared-error  or  entropic  performance 
measure.  Formal  minimization  assumes  knowledge  of  the 
sampled  probability  density  function  p(x)  and  perhaps 
additional  knowledge  of  how  some  parameters  function¬ 
ally  depend  on  other  parameters.  p(x)  describes  the  con¬ 
tinuous  distribution  of  patterns  in  /?".  In  general,  we  do 
not  know  this  probabilistic  information.  Instead,  we  use 
learning  algorithms  to  adaptively  estimate  p(x)  from 
sample  realizations.  This  procedure  often  reduces  to  sto¬ 
chastic  approximation  [11],  [12]. 

Adaptive  vector  quantization  (AVQ)  systems  adap¬ 
tively  quantize  pattern  clusters  in  R".  Stochastic  compet¬ 
itive  learning  systems  are  neural  AVQ  systems.  Neurons 
compete  for  the  activation  induced  by  randomly  sampled 
patterns.  The  corresponding  synaptic  fan-in  vectors  adap¬ 
tively  quantize  the  pattern  space  Rn.  The  p  synaptic  vec¬ 
tors  ntj  define  the  p  columns  of  the  synaptic  connection 
matrix  M.  M  interconnects  the  n  input  or  linear  neurons 
in  the  input  neuronal  field  F*  to  the  p  competing  nonlinear 
neurons  in  the  output  field  Fr. 

In  the  simplest  case,  the  p  synaptic  vectors  estimate 
centroids  or  modes  of  the  sampled  probability  density 
function  p(x).  The  estimates  are  nonparametric.  The  user 
need  not  know  or  assume  which  probability  density  func- 
I  tionp(x)generatesthetraining samples,  theobserved  real- 
'  izations  of  the  underlying  stochastic  pattern  process. 

Pattern  learning  is  supervised  if  the  system  uses  pat¬ 
tern-class  information.  Suppose  the  k  decision  classes 
{ Dj}  partition  the  pattern  space  R"l 

i 

R"  =  U  Dj  and  D,  D  DJ  =  0  if;  *  j.  ( 1 ) 

7=1 

The  system  knows  and  uses  the  class  membership  of  each 
pattern  x.  The  system  knows  that  x  e  D,  and  that  x  £  D, 
for  all  j  3*  ;.  Pattern  learning  is  unsupervised  if  the  system 
does  not  know  or  use  class  membership  information.  Un¬ 
supervised  learning  algorithms  use  unlabeled  pattern  sam¬ 
ples. 

Formally  supervised  learning  depends  on  class  indica¬ 
tor  functions  {  /„  }r 


Mx)  = 


I 

0 


if  x  e  D, 
if  x  £  If. 


(2) 


l,ii  indicates  whether  pattern  x  belongs  to  decision  class 
Dr  Unsupervised  learning  algorithms  blindly  cluster  sam 
pies.  They  do  not  depend  on  class  indicator  functions, 
"flic  random  indicator  functions  define  the  class  probabi! 
ities  P( /->,).  •  •  •  .  /’(/),).  since 
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P(D,)  =  [  P(x)  dx  (3) 

Ji), 

=  !d,(x)  p(x)  dx  (4) 

=  £!/<%}-  (5) 

£Tl  jc  ]  denotes  the  mathematical  expectation  of  scalar  ran¬ 
dom  variable  x.  The  partition  property  and  P(R")  =  1 
imply  /’(D,)  +  •  •  •  +  P(Dk  )  ~  1. 


Learning  algorithms  estimate  the  unknown  probability 
density  function  p(x).  We  need  not  learn  if  we  know 
p(x).  Instead,  we  could  compute  the  desired  quantities 
with  optimization,  numerical-analytical,  or  calculus-of- 
variation  techniques.  For  instance,  we  could  directly 
compute  the  centroids  Xj  of  the  pattern  classes  Dj.  The 
centroids  minimize  the  total  mean-squared-error  of  vector 
quantization  £, 

k  p  n 

£  =  i  2  1  S  (x,  -  mtiy  p(x)  dx.  (6) 
;  Jo,  < 

For  if  we  set  the  gradient  vector  Vm,8  equal  to  the  null 
vector  and  solve  for  the  optimal  my,  we  get 

o  =  Vm8  (7) 

=  t  (*  “  mj)p{x)dx  (8) 

JO, 

=  [  xp{x)  dx  -  Ihj  f  p(x)  dx.  (9) 

JD,  JO, 

Then 


I  xp(x)  dx 
o, 


m,  = 

(10) 

\  p(x)  dx 

Jo, 

=  , 

(M) 

as  claimed  (when  positive-definite  Hessian  conditions 
hold). 

Mcan-squarcd-crror  optimal  learning  drives  synaptic 
vectors  to  the  unknown  centroids  x,  of  (lie  locally  sampled 
pattern  classes.  More  generally  |8),  E\m,  |  =  x,  holds 
asymptotically  as  the  random  synaptic  vector  in,  wanders 
in  a  Brownian  motion  about  the  centroid  xr  We  observed 
this  Brownian  wandering  in  the  simulations  discussed  be¬ 
low  (Fig.  6). 

If  there  arc  exactly  p  distinct  pattern  classes  or  clusters, 
the  p  synaptic  row  vectors  «/,(/).  -  -  •  .  w»  (/)  should 
asymptotically  approach  the  centroid  of  a  distinct  pattern 
class  In  general,  we  do  not  know  the  number  k  of  pattern 
classes.  If  there  are  fewo  synaptic  vectors  than  the  num¬ 
ber  k  ol  pattern  classes,  if  p  <  k.  the  synaptic  vectors 
should  approach  the  centroids  of  (lie  />  most  massive,  most 
probable  pattern  clusters 


If  P  >  k,  the  synaptic  vectors  should  approximate  the 
entire  density  function  p(x).  More  synaptic  vectors 
should  arrive  at  more  probable  regions.  Where  patterns  x 
at.  dense  or  sparse,  synaptic  vectors  in,  should  be  dense 
or  sparse.  The  local  count  of  synaptic  vectors  then  gives 
an  accurate  nonparametric  estimate  of  the  volume  prob¬ 
ability  P(V)  for  volume  V  C  R"  I 

P(.V)  =  \^p(x)dx  (12) 

number  of  in.  e  V 

“ - -  ,!3> 

In  the  extreme  case  that  V  =  R ",  this  approximation  gives 
P(V)  =  p/p  =  1.  For  small  or  improbable  subsets  V, 
P(V)  =  0/p  =  0. 

Differential  competitive  learning  (DCL)  provides  a  new 
(7]  unsupervised  form  of  AVQ.  DCL  modifies  stochastic 
synaptic  vectors  with  a  competing  neuron’s  change  in 
output  signal.  The  neuronal  signal  velocity  locally  rein¬ 
forces  the  synaptic  vector.  The  time  derivative’s  sign 
changes  resemble  the  supervised  sign  changes  in  super¬ 
vised  competitive  learning  (SCL)  algorithms.  SCL  sys¬ 
tems  use  more  information  than  DCL  systems,  since  sig¬ 
nal  velocities  do  not  depend  on  class-membership 
information.  In  particular,  the  DCL  algorithm  in  (40)  be¬ 
low  does  not  use  the  class  membership  of  the  training 
sample  x. 

Both  DCL-trained  and  SCL-trained  synaptic  vectors 
tend  to  rapidly  converge  to  pauem-class  centroids  [8).  Our 
simulated  DCL  synaptic  vectors  convened  faster  to  bi¬ 
polar  centroids— points  in  {  —  1 ,  1  }  than  did  SCL  syn¬ 
aptic  vectors  when,  as  in  biological  r..ural  networks,  a 
sigmoidal  signal  function  nonlinearly  transduced  neuronal 
activations  to  bounded  signals.  DCL  systems  exploit  a 
win-rate  dependent  sequence  of  learning  coefficients.'  The 
faster  the  neuron  wins  or  loses,  the  more  the  synaptic  vec¬ 
tor  resembles  or  disresembles  the  sampled  pattern.  SCL 
systems  ignore  this  instantaneous  rate  information. 

In  practice,  input  neurons  have  linear  signal  functions: 
Si  (*,  )  =  Xj.  The  user  presents  the  random  sample  x  to  the 
system  as  the  output  of  the  Fx  neurons.  In  this  case,  our 
simulated  DCL  and  SCL  synaptic  vectors  converged 
equally  quickly.  But  the  DCL-trained  synaptic  vectors 
wandered  less  about  class  centroids  than  did  SCL-trained 
synaptic  vectors. 

II.  Stochastic  Competitive  Learning  Algorithms 

Autoassociativc  AVQ  neural  networks  are  two-layer 
feedforward  networks  trained  with  competitive  learning. 
The  input  neuronal  field  Fv  receives  the  sample  data  and 
passes  it  forward  through  synaptic  connection  matrix 
M  to  the  p  competing  neurons  in  field  k\.  (Hcteroassocia- 
tivc  AVQ  networks  correspond  to  three-layer  feedforward 
networks.)  Synchronous  feedforward  How  obviates  the 
neural  interpretation.  AVQ  neural  systems  are  simply  sig¬ 
nal  processing  algorithms. 
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The  metaphor  of  competing  neurons  reduces  to  a  near¬ 
est-neighbor  classification.  The  system  compares  the  cur¬ 
rent  vector  random  sample  .v  <  /  >  in  Euclidean  distance  to 
(he//  columns  of  the  synaptic  connection  matrix  Af.  to  the 
/»  synaptic  vectors  /«,(/).  •  ■  ■  .  If  the 7th  synaptic 

vector  f/ij  ( t )  is  closest  to  .v ( / ).  then  the  /th  neuron  “wins" 
the  competition  lor  activation  at  time  1. 

Many  within-lield  feedback  dynamical  systems  approx¬ 
imate  this  nearest-neighbor,  winncr-take-all  behavior. 
Mathematically,  the y'th  competing  neuron  should  behave 
as  a  class  indicator  function-S,  =  //>,.  More  generally,  the 
yth  F.  neuron  only  estimates  //v  Then  misclassilicalion 
can  still  occur:  Sy  (xmj  +  f)  =  !  but  /„  (x)  =  G  for  row 
vectors  x  and  ntj ,  where  ft  denotes  the  inhibitive  within- 
field  feedback  the y th  neuron  receives. 

We  modify  the  nearest  or  “winning"  synaptic  vector 
ntj  with  a  simple  difference  learning  law.  Wc  add  some 
scaled  form  of  x(/)  -  //»y  (t )  to  my  (/)  to  form  ntj(t  + 

I ).  We  can  also  update  near-neighbors  of  the  winning 
neuron.  In  practice,  and  in  the  simulations  below,  we 
modify  only  one  synaptic  vector  at  a  time.  We  do  not 
modify  “losers ”:m,(t  +  1)  =  ///,(/). 

The  stochastic  unsupervised  competitive  learning 
(UCL)  algorithm  represents  the  simplest  competitive 
learning  algorithm.  Pattern  recognition  theorists  first 
studied  the  UCL  algorithm  but  called  it  adaptive  K-means 
clustering  [10].  Kohonen  extended  the  UCL  algorithm  to 
two  supervised  versions,  SCL1  [3],  [4]  and  SCL2  [5]. 
The  supervisor  must  know  the  class  membership  of  each 
sample  pattern  x.  The  SCL1  and  SCL2  algorithms  lin¬ 
early  “reward"  correct  classifications  as  in  the  UCL  al¬ 
gorithm.  They  “punish”  incorrect  classifications  with  a 
sign  change.  We  obtain  all  three  algorithms  from  the  fol¬ 
lowing  three-step  algorithm  if  we  replace  the  third  step 
with  the  appropriate  stochastic  difference  equation. 

A.  Competitive  AVQ  Algorithms 

1)  Initialize  synaptic  vectors://!,  (0)  =  x(i),  /  =  1, 

•  •  •  ,  p.  Sample-dependent  initialization  avoids  many 
pathologies  that  can  distort  nearest-neighbor  learning. 

2)  For  random  sample  x(f),  find  the  closest  or  “win¬ 
ning"  synaptic  vector  my  (/) : 

|K(0  “  ■*  ( 0 1  =  min  ||  m,  (r)  -  x(/)||  ^  (14) 

where  ||x||"  =  v,  +  ■  •  •  +  .v ‘  defines  the  squared  Eu- 
clidean  vector  norm  of  x. 

3)  Update  the  winning  synaptic  vector  m!  ( 1 )  with  the 
UCL.  SCLI.  or  SCL2  learning  algorithm. 

l{.  Unsupervised  Competitive  Learning  ( UCL ) 

( 1  *-  I)  /«,(/)  +  r,|x(/)  -  ///,  ( / )  |  i  (13) 

/", (/  *  1 )  «», (/)  if  t  r  j  (16) 

w here  J  e,  }  denotes  a  slowly  decreasing  sequence  ol 
learning  coefficients.  In  our  simulations.  <7  0  1(1 

( r/2000  ) )  tor  2000  training  samples  The  UCL  algorithm 
(13)  restates  the  classical  adaptive  Af-mcans  clustering  al 
eorithm 


Stochastic  approximation  [11]  requires  a  decreasing 
gam  sequence  {  r, }  to  suppress  random  disturbances  and 
to  guarantee  convergence  to  a  local  minima  of  mean- 
squared  performance  measures.  The  learning  coefficients 
should  decrease  slowly. 


but  not  too  slowly. 


Z  c,  =  co 

I  =  1  7 


Oo 

S  r,‘  <  00. 
1  =  1 


(17) 


(18) 


Harmonic-series  coefficients,^  =  1  //^satisfy  these  con¬ 
straints.  For  fast  robust  [2]  stochastic  approximation,  only 
the  harmonic-series  coefficients  satisfy  these  constraints. 

C.  Supervised  Competitive  Learning  I  (SCLI) 


!"»,(')  +  c,[x{t)  -  m,(/)] 
i  fx(/)e£>, 

'M0  -  c[x(0  -  //iy(/)] 
if  x(')  &  Dj  , 


//!,  (/+  1)  =  m,(r)  if  i  &  j.  (20) 

SCL)  supervises  or  reinforces  synaptic  modification,  m 
learns  positively  if  the  system  correctly  classifies  the  ran¬ 
dom  sample  x.  mj  leams  negatively,  or  forgets  selec¬ 
tively,  if  the  system  misclassifies  the  random  sample. 
Then  my  tends  to  move  out  of  regions  of  misclassifi- 
cation  in  R”.  Tsypkin  [12]  first  derived  the  SCLI  algo¬ 
rithm  as  a  special  case  of  his  adaptive  Bayes  classifier. 
We  can  rewrite  the  SCLI  update  equation  (19)  as 

mj(t  +  1)  =  ///,(/)  +  c,rj(x{t))  [x(r)  -  ///,(/)]  (21) 

if  we  define  the  supervised  reinforcement  function  ry  as 

rJ  =  fD,  -  2  Id,.  (22) 


rj  depends  explicitly  on  class  indicator  functions.  ry  re¬ 
wards  correct  pattern  classifications  with  +  1  and  pun¬ 
ishes  misclassifications  with  —  1 .  We  implicitly  assume 
theyth  neuron  accurately  estimates  the yth  indicator  func¬ 
tion:^  (Xmj  +/y)  =  ln(x). 

The  SCL2  algorithm  slightly  modifies  the  SCLI  algo¬ 
rithm.  The  SCL2  algorithm  better  estimates  the  optimal 
Bayes  decision-theoretic  boundary  in  some  cases.  The 
Bayes  decision  boundary  minimizes  the  misclassilicalion 
error.  It  represents  the  crossing  point  of  the  unknown  con¬ 
ditional  densities  p  ( x  |  D,)  and  p(x  |  If). 

The  nearest-neighbor  decision  boundary  corresponds  to 
the  hyperplane  that  bisects  the  line  that  connects  the  two 
class  centroids.  If  the  pattern  distribution  is  asymmetric  — 
if.  for  instance,  local  density  functions  with  different 
variances  generate  diflcrcnt  decision  classes— then  the 
SCLI  decision  boundary  may  not  resemble  the  Haves  tie 
cision  boundary.  Nearest  neighbor  classification  tends  to 
perform  belter  in  the  equal  variance  case  than  in  the  un¬ 
equal  variance  case. 
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D.  Supervised  Competitive  Learning  2  (SCL2) 

+  I)  =  •»,(•)  ~  «-,!■*■(')  -  "*,<0)  ,  (23) 
+  1)  =  '"/(')  +  <v|-r(0  -  "»/('))  ,  (24) 

if  x  6  l)/  instead  of  x  e  Dr  and  if  111,(1)  is  tlie  nearest 
synaptic  vector  and  //!,(/)  is  the  next-to-nearest  synaptic 
vector  : 

!"';(/)  -  Jf(0||  <  || "'/(')  -  *(0|| 

=  min  j|wi,  (/)  -  x(/)jj  (25) 

<  *j  1 

and  if  x(t)  falls  in  a  class-dependent  “window.”  In  all 
other  cases, 

m,  ( t  +  1)  =  /«,(/).  (26) 

The  window  defines  a  hyperrcctangle  in  Rn  centered  at 
the  midpoint  of  the  hyperlinc  that  connects  the  centroids 
of  Dj  and  D,.  If  x (r)  does  not  fall  in  the  hyperwindow, 
we  modify  no  synaptic  vector.  We  defined  the  Rn  window 
between  Dy  and  £>,  as  the  n-dimensional  hyperrectangle 
[7/1,  —  d,  m,  +  d  ]  X  •  •  -  X  [ m„  —  d,m„  +  d  ],  where 
my,  denotes  the  midpoint  mjt  =  (m,,  •  •  •  ,  m„)  =  (my  + 
m,)/2,  and  d  denotes  the  window  half-width.  We  put  d 
=  2.5. 

fll.  Differential  Competitive  Learning 

The  differential  competitive  learning  (DCL)  law  [7] 
combines  competitive  and  differential  Hebbian  learning: 

™,j  =  Sj(yj)  (S/U)  ~  m,y]  }  (27) 

or  in  vector  notation, 

=  Sj  (»)  [S(x)  -  my]  ,  (28) 

where  S(x)  =  (S,(x,),  •  •  •  ,  5„(x„))  and  m;  =  (m{j, 

•  •  •  ,  m„y).  m,y  denotes  the  synaptic  weight  between  the 
ith  neuron  in  input  neuronal  field  Fx  and  the  )th  neuron  in 
competitive  field  FY.  Nonnegative  signal  functions  S,  and 
Sj  transduce  the  real-valued  activations  x,  and  yy  into  the 
bounded  monotone  nondecreasing  signals  S,  (jr,)  and 
5)  (  yy).  in,,  and  S ’  (  yy)  denote  the  time  derivatives  of  m,t 
and  Sj  (  yy),  synaptic  and  signal  velocities. 

The  stochastic  calculus  version  of  the  DCL  law  relates 
random  processes  : 

dmv  =  dSj\S,  -  m(>l  +  d B,r  (29) 

B,j  denotes  a  Brownian-motion  diffusion  process  centered 
at  the  origin.  Wc  can  rewrite  (29)  in  “noise”  notation  as 

=  Sj  ( S,  ~  I  *  (30) 

The  “noise”  process  n  has  zero  mean,/:  (  n,, )  =  0,and 
has  finite  variance, V\ n,,  ]  =  o',,  <  00.  The  random-sam¬ 
pling  AVQ  framework  implicitly  assumes  that  all  com¬ 
petitive  learning  laws  are  stochastic  differential  or  differ¬ 
ence  equations.  Such  stochastic  synaptic  vectors  m,  tend 
to  converge  to  pattern-class  centroids,  and  converge  ex 
poncntially  quickly  |8). 


Sj  (  )’j)  measures  the  competitive  status  of  the  yth  com¬ 
peting  neuron  in  Fr.  Usually,  S,  approximates  a  binary 
threshold  function.  S,  may  equal  a  steep  binary  logistic 
sigmoid  , 

5U)  =  ~e=.yj  ,  (3D 

for  some  constant  c  >  0.  The yth  neuron  wins  the  laterally 
inhibitivc  competition  if  S,  =  1 ,  loses  if  Sy  =  0. 

In  (27),  m ,  learns  only  if  S,  (  y^)  changes.  This  contrasts 
with  the  classical  competitive  learning  law 

"hj  =  Sj(yj)  (5,(x,)  -  m,y]  ,  (32) 

which  modulates  the  difference  S(x)  —  m y  with  the  win- 
loss  signal  Sj,  not  its  velocity  Sj.  In  (32),  m,  learns  only 
if  the  competitive  signal  S,  exceeds  zero— only  if  the  yth 
neuron  “wins”  the  activation  competition. 

Real  neurons  transmit  and  receive  pulse  trains.  Pulse- 
coded  signal  functions  5y  reveal  the  connection  between 
competitive  and  differential  competitive  learning.  A  pulse- 
coded  signal  function  uses  an  exponentially  fading  win¬ 
dow  1 1]  of  sampled  binary  pulses  : 


where  }’j(t)  =  I  if  a  pulse  occurs  at  t .  and  vy  (/)  =  0  if 
no  pulse  occurs  at  r.  Then  [9] 

Sj(t)  =  vy  (/)  ~  Sj(t).  (34) 

So  the  DCL  law  (27)  reduces  to 

m.y  =  >y[S,  -  m,y]  -  Sj  IS,  -  m,,].  (35) 

When  the  second  term  in  (35)  is  sufficiently  small,  DCL 
reduces  to  competitive  learning.  This  occurs  when  a  los¬ 
ing  neuron  suddenly  wins,  for  then  y;  =  1  and  Sj  ~  0.  In 
the  stochastic  case,  the  random  pulse  function  yy  repre¬ 
sents  an  arbitrary  random  point  process,  and  converts  (35) 
to  a  doubly  stochastic  model. 

Similarly,  the  classical  differential  Hebbian  law  [6] 

m,,  =  ~m,,  +S,  -  S,  (36) 

reduces  to  signal  Hebbian  learning  on  average  (in  the  ab- 


scncc  of  pulses)  : 

\ 

»>,,  =  -m,,  +  S,  S, 

(37) 

=  ~ m ,,  +  S,S,  +  |.v,vy  -  v, 

S,  -  V; S,  |  (38) 

~  +  s,  S, 

(39) 

on  average.  The  approximation  holds  exactly  if  and  only 
if  no  x,  or  v;  pulses  arc  present,  a  frequent  event.  Differ¬ 
ential  Hebbian  learning  synapses  “fill  in"  with  Hebbian 
learning  when  pulses  are  absent 

For  discrete  implementation,  we  use  the  DCL  algo 
nthm  as  a  stochastic  difference  equation. 
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A.  Differential  Competitive  Learning  ( UCL ) 

1)  Initialize:  m,  (0)  =  x(i  )• 

2)  Find  winning  in,  ( t ):  II  mj  ( 1 )  “  x(Oll  -  m,n' 

II  m,  ( t )  -  x  ( t )  || . 

3)  Update  winning  mt  (/): 

mfft  +  I)  =  m,{t)  +  r,AS,U(f))lS(*(/))  "  'Mf>] 

if  the  yth  neuron  wins  ,  ( do') 

I)  =  (r)  if  the  ith  neuron  loses  .  (41) 

A5  (y  (/))  denotes  the  time  change  of  the  7th  neuron’s 
competition  signal  Sj  ( y,)  in  the  competition  layer  F,  : 

AS,(yy(0)  =  sgn  [5y(vy(/  +  0)  ~  5, (»('))]-  (42^ 

We  define  the  signuni  operator  sgn  (x)  as 

f  1  if  at  >  0 

sgn  (.v)  =  0  if.t  =  0  (43) 

^  -  |  if  jr  <  0. 

We  update  the  FY  neuronal  activations  v,  with  the  additive 
model 

n 

y,(t  +  1)  -  y,  (»)  +  'Esi(x,{t))m4(t) 

+  Zst(yt(i))  WU‘  t44) 

In  our  simulations,  the  first  sum  in  (44)  reduced  to 

£*,(0  mtj(t)  (45) 

i 

when  we  did  not  transform  the  input  patterns  x  with  a 
nonlinear  signal  function  S,.  Input  or  Fx  neurons  in  feed¬ 
forward  networks  usually  behave  linearly:  S:  (x, )  =  x,. 

For  linear  inputs,  we  computed  the  second  sum  in  (44) 
for  linear-signal  functions  S*.  Since  we  allowed  only  one 
winner  per  iteration,  this  sum  reduced  to  a  single  term 
y  Wi ,  where  k  denotes  the  winning  neuron. 

The  p  X  p  matrix  W defined  the  Fy  within-field  synaptic 

connection  strengths  : 


-4  2  - 1  - 1 

-I  -4-2  -1 


L-l  ->  -l  -  ■  -  42  J 

Diagonal  elements  «■„  equaled  2,  off-diagonal  elements 
equaled  -  I .  Fig.  I  shows  the  connection  topology  of  the 
laterally  inhibitivc  DC  I-  network. 

Each  neuron  in  /-.  codes  for  a  specific  pattern  class.  IW 

(44)  and  the  “cosine  law"  , 

Six)  -  m,  r-  |]  .V  ( x  )  D  ||"»,||  cos  (S{x).  m,)  (47) 


positive  learning  (my  >0)  tends  to  occur  when  the  sys¬ 
tem  classifies  x  to  the  nearest  pattern  class  Dr 

If  we  represent  the  Fx  signal  function  S,  with  the  bipolar 
logistic  function  , 

S.M  -  -  '  ,  08) 

c  >  0,  then  the  DCL  algorithm  (40)  abstracts  the  corre¬ 
sponding  bipolar  pattern  from  the  real-valued  input.  The 
unsupervised  sign  change  A  Sj  in  the  DCL  law  (40)  resem¬ 
bles  the  reward-punish  sign  change  in  the  SCL1  and  SCL2 
algorithms.  This  suggests  that  we  can  meaningfully  com¬ 
pare  the  algorithms’  performance  on  the  same  training  and 
test  data. 

If  we  choose  5,  (jcf )  as  a  linear  function  of  the  input,  if 
S,  (j :,)  =  x„  then  the  discrete  version  of  DCL  resembles 
the  UCL,  SCL1,  and  SCL2  algorithms.  We  used  both  lin¬ 
ear  and  nonlinear  formulations  to  compare  DCL  to  SCL1 
and  SCL2.  The  supervised  SCL1  and  SCL2  algorithms 
always  outperformed  the  UCL  algorithm.  So  we  limited 
our  DCL  comparisons  to  SCL1  and  SCL2  systems. 

For  most  simulations,  we  used  linearly  transformed  data^ 
S,  (*,-)  =  x,.  In  these  cases,  we  approximated  the  signal 
difference  AS,  as  the  activation  difference  A y  ; 

AS,  (?,('))  *  Ayy  (/)  (49) 

=  sgn  [js(/  +  1)  -  y,(0]-  (50) 

This  approximation  holds  exactly  over  the  linear  part  of  a 
signal  function’s  range.  For  then  Sj  =  dSj/dyj  =  c  for 
some  constant  c  >  0.  Then, 

k  =  SJi-j  (51) 

=  cy,.  (52) 

The  constant  c  does  not  affect  the  signum  operator  used 
in  A  v, . 

Linear  data  often  produce  large  activation  sums  L" x,m,j 
that  saturate  nonlinear  signals  S,  to  extreme  values.  Then 
the  signal  difference  AS,  equals  zero  and  may  not  discrim¬ 
inate  changes  in  the  competitive  status.  The  activation  dif¬ 
ference  A  \  remains  sensitive  to  these  changes. 
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IV.  Comparison  of  Competitive  and  Dun  klntial 

Competitive  Learning  por  Centroid  Lstimation 

Wc  compared  the  DCL  algorithm  to  the  SCLI  and 
SCL2  algorithms  for  estimating  centroids.  All  algorithms 
adaptively  moved  the  synaptic  vectors  mJ  to  pattern-class 
centroids.  They  differed  in  how  quickly  the  trained  syn¬ 
aptic  vectors  reached  the  centroids  and  how  much  the  syn¬ 
aptic  vectors  wandered  about  the  centroids.  The  DCL  al¬ 
gorithm  moved  the  synaptic  vectors  to  centroids  at  least 
as  fast  as  did  the  SCLI  and  SCL2  algorithms.  Once  the 
synaptic  vectors  reached  the  pattern-class  centroids,  the 
DCL-trained  synaptic  vectors  wandered  less  about  the 
centroids  than  the  SCL-irained  synaptic  vectors. 

The  DCL  algorithm  converged  to  centroids  faster  than 
the  SCLI  and  SCL2  algorithms.  Convergence  rates  were 
the  same  for  linear  signal  functions, 5,  (jc,  )  =  x,.  The  pat¬ 
tern  space  consisted  of  2000  two-dimensional  Gaussian- 
distributed  pattern  vectors  with  variance  121  and  with 
centroids  or  modes  at  (20,  20),  (20,  —20).  (  —20,  20), 
and  (  —20,  —20).  Fig.  2  shows  centroid  convergence  of 
DCL  synaptic  vectors  with  inputs  transformed  with  bi¬ 
polar  signal  functions.  Fig.  3  shows  the  slower  conver¬ 
gence  of  the  SCLI  algorithm  with  the  same  transformed 
Gaussian  Data.  *  denotes  DCL  synaptic  vectors.  +  de¬ 
notes  SCLI  synaptic  vectors.  Figs.  4  and  5  show  centroid 
convergence  for  the  same  Gaussian  data  when  the  systems 
used  linear  signal  functions. 

DCL-trained  synaptic  vectors  wandered  with  less  mean- 
squared  error  about  centroids  than  did  SCL-trained  syn¬ 
aptic  vectors.  Fig.  6  shows  mean-squared  wandering  about 
the  Gaussian  pattern-class  centroid  (—20,  20).  Fig.  6 
represents  several  such  experiments  with  different  Gauss¬ 
ian  and  non-Gaussian  pattern  distributions.  Solid  lines  de¬ 
note  the  convergence  of  the  DCL  synaptic  vector.  Dashed 
lines  denote  convergence  of  the  SCLI  synaptic  vector. 


Fig.  3.  Centroid  convergence  of  SCLI  synaptic  vectors  trained  with  the 
same  patterns  as  in  Fig.  2.  Bipolar  logistic  signal  functions  nonlinearty 
transduce  real  input  patterns  to  bipolar  ->atiems 


Fig.  4.  Convergence  of  DCL-trained  synaptic  vectors  to  Gaussian  pjdtem- 
class  centroids.  Same  pattern  distribution  as  in  Figs.  2  and  3.  Input  data 
not  transformed:  5,  (x, )  =  xt. 


Fig  2.  Convergence  of  DCL-trained  synaptic  vectors  to  bipolar  centroids 
of  lour  Gaussian  clusters  Bipolar  logistic  signal  functions  S,  (  i.)  non 
linearly  transduce  Ihe  real  valued  input  vector  jr  into  a  bipolar  vector  in 

i  - 1 . » r 


Fig  5  Convergence  of  SCLI  named  synaptic  vectors  to  Gaussian  pal 
Icm-class  centroids  for  the  same  pattern  distribution  as  in  Fig  4 
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Fig.  6.  Trajectories  of  the  synaptic  vectors  after  reaching  the  Gaussian 
pattern-class  centroid  at  (  —20.  20).  Solid  lines  represent  DCL-trained 
synaptic  vectors.  Dashed  lines  represent  SCL I  -trained  synaptic  vectors. 
The  two  graphs  plot  separately  the  m,  and  m,  components  of  the  synaptic 
vector  m  =  (m,.  m2). 


We  calculated  the  mean-squared  error  (MSE)  of  centroid 
wandering  for  the  class  centered  at  (  —20,  20)  after  200 
iterations.  Other  centroids  produced  comparable  MSE  of 
centroid  wandering.  In  the  first  case,  we  used  540  Gauss¬ 
ian  samples  with  variance  25.  Then,  for  the  DCL  algo¬ 
rithm,  the  MSE  of  centroid  wandering  equaled  0.48.  For 
the  SCL1  algorithm,  it  equaled  1.48.  In  the  second  case, 
we  used  554  Gaussian  samples  with  variance  121.  Then, 
for  the  DCL  algorithm,  the  MSE  of  centroid  wandering 
equaled  4.  For  the  SCL1  algorithm,  it  equaled  7.11. 

Next  we  compared  the  DCL  system  to  the  SCL1  and 
SCL2  systems  for  pattern  classification  accuracy.  We 
trained  each  AVQ  system  with  500  Gaussian-distributed 
samples  for  each  pattern  class,  and  for  each  variance  level 
centered  about  the  same  centroids  (  —20,  0)  and  (20,  0). 
We  set  variance  levels  at  20  units.  For  each  variance  level, 
we  tested  each  AVQ  system  with  1000  new  Gaussian-dis¬ 
tributed  samples  for  each  pattern  class.  Fig.  7(a)  shows 
the  misclassification  rates  of  the  DCL,  SCL1,  and  SCL2 
systems  for  two  representative  Gaussian  classes  with  equal 
variances.  Fig.  7(b)  shows  misclassification  performance 
for  each  AVQ  system  when  we  repeated  the  simulation  in 
Fig.  7(a)  for  unequal  variances.  The  pattern  class  with 
centroid  (20,  0)  had  twice  the  variance  of  the  pattern  class 
with  centroid  (  -20,  0).  The  three  clustering  algorithms 
behaved  similarly  for  increasing  variance  values. 


V.  Phoni  mi  Recognition  Simulations 

Wc  obtained  speech  training  samples  from  samples  of 
continuous  male  speech  with  different  English  pronunci 
ations.  Wc  used  a  time  dependent  Fourier  spectrum  to  ex¬ 
tract  features  from  the  speech  waveforms.  An  anti-alias 
low-pass  filter  prefiltcred  the  speech  signals  We  then  dig 
mml  the  signals  to  X'^with  a  10  kHz  sampling  fre¬ 
quency  A  llamminu  window  divided  the  digitized  speech 
signal  into  256  sample  segments.  The  fast  Fourier  trans- 


Fig.  7.  AVQ  misclassificat'on  rales  for  two  Gaussian  clusters:  (a)  with 
equal  variance  centered  about  the  centroids  (  -20.  0)  and  (20,  0);  and 
(b)  with  unequal  variance.  In  (b).  the  pattern  class  centered  about  (20, 
0)  has  twice  the  variance  of  the  pattern  class  centered  about  (  —20,  0). 

form  algorithm  gave  256  complex  Fourier  coefficients  for 
each  of  the  256  windowed  sample  segments.  We  divided 
the  200  Hz-5  kHz  frequency  range  into  16  regions.  We 
divided  the  200  Hz-3  kHz  frequency  range  into  12  equal 
regions  and  the  3-5  kHz  frequency  range  into  four  equal 
regions.  Six  Fourier  coefficients  represented  each  region 
in  the  200  Hz-3  kHz  range.  13  Fourier  coefficients  rep¬ 
resented  each  region  in  the  3-5  kHz  range.  We  calculated 
average  power  spectra  over  each  region  to  form  a  16-di¬ 
mensional  pattern  vector.  We  produced  16-dimensional 
phoneme  pattern  vectors  by  repeatedly  sliding  the  Ham¬ 
ming  window  by  100  samples. 

The  sample  space  consisted  of  real  and  artificial  pho¬ 
nemes.  The  artificial  phonemes  were  Gaussian  random 
vectors  with  variation  9  centered  at  the  real  phoneme  vec¬ 
tors.  We  generated  these  noisy  phoneme  samples  to  pro¬ 
vide  the  AVQ  systems  with  a  statistically  representative 
set  of  training  samples. 

The  simulation  compared  the  DCL,  SCL1,  and  SCL2 
learning  systems  for  classification  of  nine  representative 
English  phonemes:  five  vowels  la.  c,  i,  o,  ul ,  two  fri¬ 
catives  / /,  si ,  one  nasal  Ini,  and  one  plosive  sound  III. 
Table  1  lists  the  misclassification  rates.  The  AVQ  systems 
tended  to  more  accuiately  classify  vowel  and  nasal  sounds 
than  they  classified  fricative  and  plosive  sounds. 

Wc  trained  each  competitive  AVQ  system  with  1000 
Gaussian-distributed  random  phoneme  vectors  clustered 
into  nine  pattern  classes.  Each  pattern  class  was  centered 
about  the  original  spoken  phoneme  and  radially  distrib¬ 
uted  with  variance  a2  =  9.  Wc  randomly  selected  training 
data  according  to  a  uniform  probability  distribution  to 
simulate  nine  cquiprobable  pattern  classes.  Wc  tested  each 
AVQ  system  with  100  new  Gaussian-distributed  phoneme 
samples  for  each  phoneme  type.  Except  for  the  two  pho¬ 
nemes  lot  and  111.  the  DCL  system  mtsclassified  no  more 
frequently  than  the  SCL  systems. 


t 


nCL 

sen 

SCL2 

hi 

0 

0 

0 

hi 

3 

9 

4 

/■/ 

0 

0 

4 

M 

5 

3 

16 

hi 

o 

1 

2 

III 

28 

43 

53 

hi 

1 

2 

6 

hi 

3 

4 

7 

N 

52 

48 

26 

TABLE  I 

Percentage  Misclassification  Rates  of  the  DCL.  SCLI ,  and  SCL2 
Systems  for  the  Nine  English  Phonemes  la ,  e.  /.  o.  u.f.  s.  n,  tt 


Conclusions 

The  DCL  system  performed  well  in  centroid  estimation 
and  phoneme  recognition.  DCL  synaptic  vectors  con¬ 
verged  faster  to  centroids  than  did  SCLI  synaptic  vectors 
when  logistic  bipolar  signal  functions  transformed  the  in¬ 
put  sample.  DCL  synaptic  vectors  wandered  less  about 
pattern-class  centroids  than  SCL  synaptic  vectors. 

Our  phoneme-recognition  simulations  were  prelimi¬ 
nary,  but  agreed  with  our  centroid-estimation  simula¬ 
tions.  The  phoneme-recognition  simulations  suggest  that 
unsupervised  DCL  systems  will  perform  as  well  as  super¬ 
vised  SCLI  and  SCL2  systems  in  many  pattern  environ¬ 
ments,  even  though  DCL  systems  use  less  pattern-class 
information. 

In  general,  we  do  not  know  in  advance  whether  x  e  D, 
for  every  training  sample  x,  and  for  every  pattern  class 
D,,  for  an  arbitrary  classification,  filtering,  or  estimation 
problem.  We  may  not  even  know  approximat^humber 
or  characteristics  of  the  underlying  decision  classes.  We 
can  still  apply  DCL  techniques  in  these  cases  and  expect 
SCL-level  performance.  But  we  may  never  know  how 
SCL  systems  would  perform  on  the  same  data. 
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CHAPTER  17 


FUZZY  ASSOCIATIVE  MEMORIES 


Fuzzy  Systems  as  Between-Cube  Mappings 

In  Chapter  16,  we  introduced  continuous  or  fuzzy  sets  as  points  in  the  unit  hypercube 

/n  =  [0,  l]n.  Within  the  cube  we  were  interested  in  the  distance  between  points.  This  led 

to  measures  of  the  size  and  fuzziness  of  a  fuzzy  set  and,  more  fundamentally,  to  a  measure 

« 

of  how  much  one  fuzzy  set  is  a  subset  of  another  fuzzy  set.  This  within-cube  theory  directly 
extends  to  the  continuous  case  where  the  space  X  is  a  subset  of  Rn  or,  in  general,  where 
X  is  a  subset  of  products  of  real  or  complex  spaces. 

The  next  step  is  to  consider  mappings  between  fuzzy  cubes.  This  level  of  abstraction 
provides  a  surprising  and  fruitful  alternative  to  the  propositional  and  predicate-calculus 
reasoning  techniques  used  in  artificial-intelligence  (Al)  expert  systems.  It  allows  us  to 
reason  with  sets  instead  of  propositions. 

The  fuzzy  set  framework  is  numerical  and  multidimensional.  The  A I  framework  is 
symbolic  and  one-dimensional,  with  usually  only  bivalent  expert  “rules”  or  propositions 
allowed.  Both  frameworks  can  encode  structured  knowledge  in  linguistic  form.  But  the 
fuzzy  approach  translates  the  structured  knowledge  into  a  flexible  numerical  framework 
and  processes  it  in  a  manner  that  resembles  neural  network  processing.  The  numerical 
framework  also  allows  fuzzy  systems  to  be  adaptively  inferred  and  modified,  perhaps  with 
neural  or  statistical  techniques,  directly  from  problem  domain  sample  data. 
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Betwcen-cube  theory  is  fuzzy  systems  theory.  A  fuzzy  set  is  a  point  in  a  cube.  A 
fuzzy  system  is  a  mapping  between  cubes.  A  fuzzy  system  5  maps  fuzzy  sets  to  fuzzy 
sets.  Thus  a  fuzzy  system  S  is  a  transformation  S  :  /"  — »  lv .  The  n-dimcnsional 
unit  hypercube  /”  houses  all  the  fuzzy  subsets  of  the  domain  space,  or  input  universe  of 
discourse ,  X  =  {xi, . . .  /p  houses  all  the  fuzzy  subsets  of  the  range  space,  or  output 

universe  of  discourse,  Y  =  {yi,...,yp}.  X  and  Y  can  also  be  subsets  of  Rn  and  Iip .  Then 
the  fuzzy  power  sets  F(2X)  and  F( 2Y)  replace  In  and  Ip. 

In  general  a  fuzzy  system  S  maps  families  of  fuzzy  sets  to  families  of  fuzzy  sets,  thus 
S  :  /”*  x  ...  x  7"r  — ►  /Pl  x  ...  x  /p*.  Here  too  we  can  extend  the  definition  of  a 
fuzzy  system  to  allow  arbitrary  products  of  arbitrary  mathematical  spaces  to  serve  as  the 
domain  or  range  spaces  of  the  fuzzy  sets. 

(A  technical  comment  is  in  order  for  sake  of  historical  clarification.  A  tenet,  perhaps 
the  defining  tenet,  of  the  classical  theory  [Dubois,  1980]  of  fuzzy  sets  as  functions  concerns 
the  fuzzy  extension  of  any  mathematical  function.  This  tenet  holds  that  any  function 
/  :  X  — >  Y  that  maps  points  in  X  to  points  in  Y  can  be  extended  to  map  the  fuzzy 
subsets  of  X  to  the  fuzzy  subsets  of  Y.  The  so-called  extension  principle  is  used  to  define 
the  set-function  /:  F(2X)  — +  F(2y)1  where  F(2X)  is  the  fuzzy  power  set  of  X,  the  set 
of  all  fuzzy  subsets  of  X.  The  formal  definition  of  the  extension  principle  is  complicated. 
The  key  idea  is  a  supremum  of  pairwise  minima.  Unfortunately,  the  extension  principle 
achieves  generality  at  the  price  of  triviality.  One  can  sh  w  [Kosko,  1986a-87)  that  in  general 
the  extension  principle  extends  functions  to  fuzzy  sets  by  stripping  the  fuzzy  sets  of  their 
fuzziness,  mapping  the  fuzzy  sets  into  bit  vectors  of  nearly  all  Is.  This  shortcoming, 
combined  with  the  tendency  of  the  extension-principle  framework  to  push  fuzzy  theory 
into  largely  inaccessible  regions  of  abstract  mathematics,  led  in  part  to  the  development 
of  the  alternative  sets-as-points  geometric  framework  of  fuzzy  theory.) 

We  shall  focus  on  fuzzy  systems  S  :  /”  — >  lv  that  map  balls  of  fuzzy  sets  in  l"  to 
balls  of  fuzz}'  sets  in  I1’.  These  continuous  fuzzy  systems  behave  as  associative  memories. 
They  map  close  inputs  to  close  outputs.  We  shall  refer  to  them  as  fuzzy  associative 
memories,  or  FA  Ms. 

'Flic  simplest  FAM  encodes  the  FAM  rule  or  association  (/!,,  B,),  which  associates 


2 


the  p-dimcnsional  fuzzy  set  /?,  with  the  n-dimensional  fuzzy  set  A,.  These  minimal  FAMs 
essentially  map  one  ball  in  /"  to  one  hall  in  Iv.  They  are  comparable  to  simple  neural 
networks.  Hut  the  minimal  FAMs  need  not  be  adaptively  trained.  As  discussed  below, 
structured  knowledge  of  the  form  “If  traffic  is  heavy  in  this  direction,  then  keep  the  stop 
light  green  longer”  can  be  directly  encoded  in  a  Hebbian-style  FAM  matrix.  In  practice 
we  can  eliminate  even  this  matrix.  In  its  place  the  user  encodes  the  fuzzy-set  association 
(HEAVY,  LONGER)  as  a  single  linguistic  entry  in  a  FAM  bank  matrix. 

In  general  a  FAM  system  F  :  /”  — *  Ip  encodes  and  processes  in  parallel  a  FAM 
bank  of  m  FAM  rules  (Ai,  Bi), . . .  ,  (Am,  Bm).  Each  input  A  to  the  FAM  system  activates 
each  stored  FAM  rule  to  different  degree.  The  minimal  FAM  that  stores  (A,,  Bt)  maps 
input  A  to  /?',  a  partially  activated  version  of  B,.  The  more  A  resembles  A,,  the  more  B' 
resembles  B,.  The  corresponding  output  fuzzy  set  B  combines  these  partially  activated 
fuzzy  sets  B[ , . . . ,  B'm.  In  the  simplest  case  B  is  a  weighted  average  of  the  partially  activated 
sets: 

B  =  wlB[  +  ...  +  wm  B'm  , 

where  to,-  reflects  the  credibility,  frequency,  or  strength  of  the  fuzzy  association  (A;,  Bi).  In 
practice  we  usually  “defuzzify”  the  output  waveform  B  to  a  single  numerical  value  y}  in  Y 
by  computing  the  fuzzy  centroid  of  B  with  respect  to  the  output  universe  of  discourse  Y . 

More  general  still,  a  FAM  system  encodes  a  bank  of  compound  FAM  rules  that  associate 
multiple  output  or  consequent  fuzzy  sets  Bj , . . . ,  B-  with  multiple  input  or  antecedent  fuzzy 
sets  A*,. . . ,  A”.  We  can  treat  compound  FAM  rules  as  compound  linguistic  conditionals. 
Structured  knowledge  can  then  be  naturally,  and  in  many  cases  easily,  obtained.  We 
combine  antecedent  and  consequent  sets  with  logical  conjunction,  disjunction,  or  negation. 
For  instance,  we  would  interpret  the  compound  association  (A1,  A2;  B)  linguistically  as 
the  compound  conditional  “IF  A1  is  A1  AND  .Y2  is  A2  ,  THEN  Y  is  B"  if  the  comma  in 
the  fuzzy  association  (A1,  A2;  B)  stood  for  conjunction  instead  of,  say,  disjunction. 

We  specify  in  advance  the  numerical  universes  of  discourse  A'1,  A  2,  and  Y.  For  each 
universe  of  discourse  A',  we  specify  an  appropriate  library  of  fuzzy  set  values,  A\, . . . ,  Ark. 
Contiguous  fuzzy  sets  in  a  library  overlap.  In  principle  a  neural  network  can  estimate  these 


libraries  of  fuzzy  sets.  In  practice  this  is  usually  unnecessary.  The  library  sets  represent 
a  weighted,  though  overlapping,  quantization  of  the  input  space  X.  A  different  library  of 
fuzzy  sets  similarly  quantizes  the  output  space  Y.  Once  the  library  of  fuzzy  sets  is  defined, 
we  construct  the  FAM  by  choosing  appropriate  combinations  of  input  and  output  fuzzy 
sets.  We  can  use  adaptive  techniques  to  make,  assist,  or  modify  these  choices. 

An  adaptive  FAM  (AFAM)  is  a  time-varying  FAM  system.  System  parameters  grad¬ 
ually  change  as  the  FAM  system  samples  and  processes  data.  Below  we  discuss  how  neural 
network  algorithms  can  adaptively  infer  FAM  rules  from  training  data.  In  principle  learn¬ 
ing  can  modify  other  FAM  system  components,  such  as  the  libraries  of  fuzzy  sets  or  the 
FAM -rule  weights  in,-. 

Below  we  propose  and  illustrate  an  unsupervised  adaptive  clustering  scheme,  based  on 
competitive  learning,  for  “blindly”  generating  and  refining  the  bank  of  FAM  rules.  In  some 
cases  we  can  use  supervised  learning  techniques,  though  we  need  additional  information 
to  accurately  generate  error  estimates. 

FUZZY  AND  NEURAL  FUNCTION  ESTIMATORS 

Neural  and  fuzzy  systems  estimate  sampled  functions  and  behave  as  associative  mem¬ 
ories.  They  share  a  key  advantage  over  traditional  statistical-estimation  and  adaptive- 
control  approaches  to  function  estimation.  They  are  model-free  estimators.  Neural  and 
fuzzy  systems  estimate  a  function  without  requiring  a  mathematical  description  of  how  the 
output  functionally  depends  on  the  input.  They  “learn  from  example.”  More  precisely, 
they  learn  from  samples. 

Both  approaches  are  numerical,  can  be  partially  described  with  theorems,  and  admit  an 
algorithmic  characterization  that  favors  silicon  and  optical  implementation.  These  prop¬ 
erties  distinguish  neural  and  fuzzy  approaches  from  the  symbolic  processing  approaches  of 
artificial  intelligence. 

Neural  and  fuzzy  systems  differ  in  how  they  estimate  sampled  functions.  They  differ 
in  the  kind  of  samples  used,  how  they  represent  and  store  those  samples,  and  how  they 
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associal  ivcly  “inference”  or  map  inputs  to  outputs. 

These  differences  appear  during  system  construction.  The  neural  approach  requires 
the  specification  of  a  nonlinear  dynamical  system,  usually  feedforward,  the  acquisition  of 
a  sufficiently  representative  set  of  numerical  training  samples,  and  the  encoding  of  those 
training  samples  in  the  dynamical  system  by  repeated  learning  cycles.  The  fuzzy  system 
requires  only  that  a  linguistic  “rule  matrix”  be  partially  filled  in.  This  task  is  markedly 
simpler  than  designing  and  training  a  neural  network.  Once  we  construct  the  systems,  we 
can  present  the  same  numerical  inputs  to  either  system.  The  outputs  will  be  in  the  same 
numerical  space  of  alternatives.  So  both  systems  correspond  to  a  surface  or  manifold  in 
the  input-output  product  space  X  xf.  We  present  examples  of  these  surfaces  in  Chapters 
18  and  19. 

Which  system,  neural  or  fuzzy,  is  more  appropriate  for  a  particular  problem  depends  on 
the  nature  of  the  problem  and  the  availability  of  numerical  and  structured  data.  To  date 
fuzzy  techniques  have  been  most  successfully  applied  to  control  problems.  These  problems 
often  permit  comparison  with  standard  control-theoretic  and  expert-system  approaches. 
Neural  networks  so  far  seem  best  applied  to  ill-defined  two-class  pattern  recognition  prob¬ 
lems  (defective  or  nondefective,  bomb  or  not,  etc.).  The  application  of  both  approaches  to 
new  problem  areas  is  just  beginning,  amid  varying  amounts  of  enthusiasm  and  scepticism. 

Fuzzy  systems  estimate  functions  with  fuzzy  set  samples  (A,,  Bf).  Neural  systems  use 
numerical  point  samples  (x,,  yt).  Both  kinds  of  samples  are  from  the  input-output  product 
space  X  x  Y .  Figure  17.1  illustrates  the  geometry  of  fuzzy-set  and  numerical-point  samples 
taken  from  the  function  /:  X  — >  V. 

The  fuzzy-set  association  (/4,,  Bt)  is  sometimes  called  a  “rule.”  This  is  misleading 
since  reasoning  with  sets  is  not  the  same  as  reasoning  with  propositions.  Reasoning  with 
sets  is  harder.  Sets  are  multidimensional,  and  associations  are  housed  in  matrices,  not 
conditionals.  We  must  take  care  how  we  define  each  term  and  operation.  We  shall  refer  to 
the  antecedent  term  At  in  the  fuzzy  association  (/!,,  B,)  as  the  input  associant  and  the 


consequent  term  £?,  as  the  output  associant. 


FIGURE  17.1  Function  /  maps  domain  X  to  range  Y.  In  the  first  illustra¬ 
tion  we  use  several  numerical  point  samples  (x,,  yt)  to  estimate  /:  X  — *  Y. 
In  the  second  case  we  use  only  a  few  fuzzy  subsets  A,  of  X  and  Z?,  of  Y.  The 
fuzzy  association  (/!,,  B,)  represents  system  structure,  as  an  adaptive  cluster¬ 
ing  algorithm  might  infer  or  as  an  expert  might  articulate.  In  practice  there  are 
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usually  fewer  different  output  associants  or  “rule”  consequents  B,  than  input 
associants  or  antecedents  At. 


The  fuzzy-set  sample  (Ai,  Bt)  encodes  structure.  It  represents  a  mapping  itself,  a  min¬ 
imal  fuzzy  association  of  part  of  the  output  space  with  part  of  the  input  space.  In  practice 
this  resembles  a  meta-rule — IF  Aj,  THEN  B, — the  type  of  structured  linguistic  rule  an  ex¬ 
pert  might  articulate  to  build  an  expert-system  “knowledge  base”.  The  association  might 
also  be  the  result  of  an  adaptive  clustering  algorithm. 

Consider  a  fuzzy  association  that  might  be  used  in  the  intelligent  control  of  a  traffic 
light:  “If  the  traffic  is  heavy  in  this  direction,  then  keep  the  light  green  longer.”  The 
fuzzy  association  is  (HEAVY,  LONGER).  Another  fuzzy  association  might  be  (LIGHT, 
SHORTER).  The  fuzzy  system  encodes  each  linguistic  association  or  “rule”  in  a  numerical 
fuzzy  associative  memory  { FAM)  mapping.  The  FAM  then  numerically  processes  numerical 
input  data.  A  measured  description  of  traffic  density  (e.g.,  150  cars  per  unit  road  surface 
area)  then  corresponds  to  a  unique  numerical  output  (e.g.,  3  seconds),  the  “recalled” 
output. 

The  degree  to  which  a  particular  measurement  of  traffic  density  is  heavy  depends  on 
how  we  define  the  fuzzy  set  of  heavy  traffic.  The  definition  may  be  obtained  from  statistical 
or  neural  clustering  of  historical  data  or  from  pooling  the  responses  of  experts.  In  practice 
the  fuzzy  engineer  and  the  problem  domain  expert  agree  on  one  of  many  possible  libraries 
of  fuzzy  set  definitions  for  the  variables  in  question. 

The  degree  to  which  the  traffic  light  is  kept  green  longer  depends  on  the  degree  to 
which  the  measurement  is  heavy.  In  the  simplest  case  the  two  degrees  arc  the  same.  In 
general  they  differ.  In  actual  fuzzy  systems  the  output  control  variables — in  this  case  the 
single  variable  green  light  duration — depend  on  many  FAM  rule  antecedents  or  associants 
that  are  activated  to  different  degrees  by  incoming  data. 


Neural  vs.  Fuzzy  Representation  of  Structured  Knowledge 


The  functional  distinction  between  how  fuzzy  and  neural  systems  differ  begins  with 
how  they  represent  structured  knowledge.  How  would  a  neural  network  encode  the  same 
associative  information?  How  would  a  neural  network  encode  the  structured  knowledge 
“If  the  traffic  is  heavy  in  this  direction,  then  keep  the  light  green  longer”? 

The  simplest  method  is  to  encode  two  associated  numerical  vectors.  One  vector  rep¬ 
resents  the  input  associant  HEAVY.  The  other  vector  represents  the  output  associant 
LONGER.  But  this  is  too  simple.  For  the  neural  network’s  fault  tolerance  now  works 
to  its  disadvantage.  The  network  tends  to  reconstruct  partial  inputs  to  complete  sample 
inputs.  It  erases  the  desired  partial  degrees  of  activation.  If  an  input  is  close  to  A,,  the 
output  will  tend  to  be  B{.  If  the  output  is  distant  from  A,,  the  output  will  tend  to  be  some 
other  sampled  output  vector  or  a  spurious  output  altogether. 

A  better  neural  approach  is  to  encode  a  mapping  from  the  heavy-traffic  subspace  to 
the  longer-time  subspace.  Then  the  neural  network  needs  a  representative  sample  set  to 
capture  this  structure.  Statistical  networks,  such  as  adaptive  vector  quantizers,  may  need 
thousands  of  statistically  representative  samples.  Feedforward  multi-layer  neural  networks 
trained  with  the  backpropagation  algorithm  may  need  hundreds  of  representative  numerical 
input-output  pairs  and  may  need  to  recycle  these  samples  tens  of  thousands  of  times  in 
the  learning  process. 

The  neural  approach  suffers  a  deeper  problem  than  just  the  computational  burden  of 
training.  What  does  it  encode?  How  do  we  know  the  network  encodes  the  original  struc¬ 
ture?  What  does  it  recall?  There  is  no  natural  inferential  audit  trail.  System  nonlinearities 
wash  it  away.  Unlike  an  expert  system,  we  do  not  know  which  inferential  paths  the  network 
uses  to  reach  a  given  output  or  even  which  inferential  paths  exist.  There  is  only  a  system  of 
synchronous  or  asynchronous  nonlinear  functions.  Unlike,  say,  the  adaptive  Kalman  filter, 
we  cannot  appeal  to  a  postulated  mathematical  model  of  how  the  output  state  depends  on 
the  input  state.  Model-free  estimation  is,  after  all,  the  central  computational  advantage 
of  neural  networks.  The  cost  is  system  inscrutability. 


We  are  left  with  an  unstructured  computational  black  box.  We  do  not  know  what  the 
neural  network  encoded  during  training  or  what  it  will  encode  or  forget  in  further  training. 
(For  competitive  adaptive  vector  quant'zcrs  we  do  know  that  sample-space  centroids  are 
asymptotically  estimatec.)  We  can  characterize  the  neural  network’s  behavior  only  by 
exhaustively  passing  all  inputs  through  the  black  box  and  recording  the  recalled  outputs. 
The  characterization  may  be  in  terms  of  a  summary  scalar  like  mean-squared  error. 

This  black-box  characterization  of  the  network’s  behavior  involves  a  computational 
dilemma.  On  the  one  hand,  for  most  problems  the  number  of  input-output  cases  we  need 
to  check  is  computationally  prohibitive.  On  the  other,  when  the  number  of  input-output 
cases  is  tractable,  we  may  as  well  store  these  pairs  and  appeal  to  them  directly,  and  without 
error,  as  a  look-up  table.  In  the  first  case  the  neural  network  is  unreliable.  In  the  second 
case  it  is  unnecessary. 

A  further  problem  is  sample  generation.  Where  did  the  original  numerical  point  samples 
come  from?  Was  an  expert  asked  to  give  numbers?  How  reliable  are  such  numerical  vectors, 
especially  when  the  expert  feels  most  comfortable  giving  the  original  linguistic  data?  This 
procedure  seems  at  most  as  reliable  as  the  expert-system  method  of  asking  an  expert  to 
give  condition-action  rules  with  numerical  uncertainty  weights. 

Statistical  neural  estimators  require  a  “statistically  representative”  sample  set.  We  may 
need  to  randomly  “create”  these  samples  from  an  initial  small  sample  set  by  bootstrap  tech¬ 
niques  or  by  random-number  generation  of  points  clustered  near  the  original  samples.  Both 
sample-augmentation  procedures  assume  that  the  initial  sample  set  sufficiently  represents 
the  underlying  probability  distribution.  The  problem  of  where  the  original  sample  set 
comes  from  remains.  The  fuzziness  of  the  notion  “statistically  representative”  compounds 
the  problem.  In  general  we  do  not  know  in  advance  how  well  a  given  sample  set  reflects  an 
unknown  underlying  distribution  of  points.  Indeed  when  the  network  is  adapting  on-line, 
we  know  only  past  samples.  The  remainder  of  the  sample  set  is  in  the  unsampled  future. 

In  contrast,  fuzzy  systems  directly  encode  the  linguistic  sample  (HEAVY,  LONGER)  in 
a  dedicated  numerical  matrix.  The  default  encoding  technique  is  the  fuzzy  Hcbb  procedure 
discussed  below.  For  practical  problems,  as  mentioned  above,  the  numerical  matrix  need 
not  be  stored.  Indeed  it  need  not  even  be  formed.  Certain  numerical  inputs  permit  this 
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simplification,  as  we  shall  see  below.  In  genera)  we  describe  inputs  by  an  uncertainty 
distribution,  probabilistic  or  fuzzy.  Then  we  must  use  the  entire  matrix. 

For  instance,  if  a  heavy  traffic  input  is  simply  the  number  150,  we  can  omit  the  FAM 
matrix.  But  if  the  input  is  a  Gaussian  curve  with  mean  150,  then  in  principle  we  must 
process  the  vector  input  with  a  FAM  matrix.  (In  practice  we  might  use  only  the  mean.) 
This  difference  is  explained  below.  The  dimensions  of  the  linguistic  FAM  bank  matrix 
are  usually  small.  The  dimensions  reflect  the  quantization  levels  of  the  input  and  output 
spaces. 

The  fuzzy  approach  combines  the  purely  numerical  approaches  of  neural  networks  and 
mathematical  modeling  with  the  symbolic,  strut '  •'re-rich  approaches  of  artificial  intelli¬ 
gence.  We  acquire  knowledge  symbolically — or  numerically  if  we  use  adaptive  techniques 
— but  represent  it  numerically.  We  also  process  data  numerically.  Adaptive  FAM  rules 
correspond  to  common-sense,  often  non-articuiated,  behavioral  rules  that  improve  with 
experience. 

We  can  acquire  structured  expertise  in  the  fuzzy  terminology  of  the  knowledge  source, 
the  “expert.”  This  requires  little  or  no  force-fitting.  Such  is  the  expressive  power  of 
fuzziness.  Yet  in  the  numerical  domain  we  can  prove  theorems  and  design  hardware. 

This  approach  does  not  abandon  neural  network  techniques.  Instead,  it  limits  them  to 
unstructured  parameter  and  state  estimation,  pattern  recognition,  and  cluster  formation. 
The  system  architecture  remains  fuzzy,  though  perhaps  adaptively  so.  In  the  same  spirit, 
no  one  believes  that  the  brain  is  a  single  unstructured  neural  network. 


FAMS  as  Mappings 

Fuzzy  associative  memories  (FA Ms)  are  transformations.  FA  Ms  map  fuzzy  sets 
to  fuzzy  sets.  They  map  unit  cubes  to  unit  cubes.  This  is  evident  in  Figure  17.1.  In 
the  simplest  case  t  he  FAM  consists  of  a  single  association,  such  as  (11FAVY,  LONGER). 
In  general  the  FAM  consists  of  a  bank  of  different  FAM  associations.  Each  association 
is  represented  by  a  different  numerical  FAM  matrix,  or  a  different  entry  in  a  FAM-bank 
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matrix.  These  matrices  are  not  combined  as  with  neural  network  associative  memory 
(outer-product)  matrices.  (An  exception  is  the  fuzzy  cognitive  map  [Kosko,  1988;  Taber, 
1987,  1990].)  The  matrices  are  stored  separately  but  accessed  in  parallel. 

We  begin  with  single- association  FA  Ms.  For  concreteness  let  the  fuzzy-set  pair  (A,  13) 
encode  the  traffic-control  association  (HEAVY,  LIGHT).  We  quantize  the  domain  of  traffic 
density  to  the  n  numerical  variables  xt,  x2,  . . . ,  xn.  We  quantize  the  range  of  green-light 
duration  to  the  p  variables  yt,  y2,  . ..,  yp.  The  elements  x,  and  y}  belong  respectively  to 
the  ground  sets  X  =  {xi,  xn}  and  Y  —  {yj,  ...,  yp).  n  might  represent  zero 
traffic  density.  yp  might  represent  10  seconds. 

The  fuzzy  sets  A  and  B  are  fuzzy  subsets  of  X  and  Y.  So  A  is  point  in  the  n- 
dimensional  unit  hypercube  In  =  [0,  l]n,  and  B  is  a  point  in  the  p-dimensional  fuzzy 
cube  Ip.  Equivalently,  we  can  think  of  A  and  B  as  membership  functions  mA  and  mu 
mapping  the  elements  Xi  of  X  and  y3  of  Y  to  degrees  of  membership  in  [0,  1],  The 
membership  values,  or  fit  (fuzzy  unit)  values,  indicate  how  much  x,  belongs  to  or  fits  in 
subset  A,  and  how  much  yj  belongs  to  B.  We  describe  this  with  the  abstract  functions 
mA  :  X  — >  [0,  1]  and  me  :  Y  — ►  [0,  1].  We  shall  freely  view  sets  both  as  functions 
and  as  points. 

The  geometric  sets-as-points  interpretation  of  fuzzy  sets  A  and  B  as  points  in  unit 
cubes  allows  a  natural  vector  representation.  We  represent  A  and  B  by  the  numerical  fit 
vectors  A  =  (<zi,  ...,  an)  and  B  =  (6i,  ...,  bp),  where  a,  =  mA(x{)  and  b}  =  rng(yj). 
We  can  interpret  the  identifications  A  =  HEAVY  and  B  =  LONGER  to  suit  the  problem 
at  hand.  Intuitively  the  a,  values  should  increase  as  the  index  i  increases,  perhaps  ap¬ 
proximating  a  sigmoid  membership  function.  Figure  17.2  illustrates  three  possible  fuzzy 
subsets  of  the  universe  of  discourse  X . 


XI  =  o  50  too  ISO  X(1  =  200 

TRAFFIC  DENSITY 

FIGURE  17.2  Three  possible  fuzzy  subsets  of  traffic  density  space  X .  Each 
fuzzy  sample  corresponds  to  such  a  subset.  We  draw  the  fuzzy  sets  as  contin¬ 
uous  membership  functions.  In  practice  membership  values  are  quantized.  So 
the  sets  are  points  in  the  unit  hypercube  /n.  Each  fuzzy  sample  corresponds 
to  such  a  subset. 

Fuzzy  Vector-Matrix  Multiplication:  Max-Min  Composition 

Fuzzy  vector-matrix  multiplication  is  similar  to  classical  vector-matrix  multiplication. 
We  replace  pairwise  multiplications  with  pairwise  minima.  We  replace  column  (row)  sums 
with  column  (row)  maxima.  We  denote  this  fuzzy  vector-matrix  composition  relation, 
or  the  max-min  composition  relation  (Klir,  1988],  by  the  composition  operator  “o” .  For 
row  fit  vectors  A  and  B  and  fuzzy  n-by-p  matrix  M  (a  point  in  /’'x,>): 

A  o  M  —  B  ,  (1) 
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where  we  compute  the  “recalled”  component  b}  by  taking  the  fuzzy  inner  product  of  fit 
vector  A  with  the  jth  column  of  M: 

bj  =  max  min(a,,  m,_,)  .  (2) 

i<«<n 

Suppose  we  compose  the  fit  vector  A  =  (.3  .4  .8  1)  with  the  fuzzy  matrix  M  given  by 


M 


(  .2  .8  .7 
.7  .6  .6 
.8  .1  .5 
^  0  .2  .3 


\ 


/ 


Then  we  compute  the  “recalled”  fit  vector  B  ~  A  o  M  component-wise  as 


bx  =  max{min(.3,  .2),  min(.4,  .7),  min(.8,  .8),  min(l,  0)} 

=  max(.2,  A,  .8,  0) 

=  -8  , 

b 2  =  max(.3,  .4,  .1,  .2) 

=  -4  , 

b3  —  max(.3,  .4,  .5,  .3) 

=  .5  . 

So  B  =  (.8  .4  .5).  If  we  somehow  encoded  ( A ,  B)  in  the  FAM  matrix  M,  we  would  say 
that  the  FAM  system  exhibits  perfect  recall  in  the  forward  direction. 

The  neural  interpretation  of  max-min  composition  is  that  each  neuron  in  field  Fy 
(or  field  Fg)  generates  its  signal/activation  value  by  fuzzy  linear  composition.  Passing 
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information  back  through  MT  allows  us  to  interpret  the  fuzzy  system  as  a  bidirectional  as¬ 
sociative  memory  (BAM).  The  Bidirectional  FAM  Theorems  below  characterize  successful 
BAM  recall  for  fuzzy  correlation  or  llebbian  learning. 

For  completeness  we  also  mention  the  max-product  composition  operator,  which 
replaces  minimum  with  product  in  (2): 

bj  =  max  a,  m,; 

1  <i<n 

In  the  fuzzy  literature  this  composition  operator  is  often  confused  with  the  fuzzy  correlation 
encoding  scheme  discussed  below.  Max-product  composition  is  a  method  for  “multiply¬ 
ing”  fuzzy  matrices  or  vectors.  Fuzzy  correlation,  which  also  uses  pairwise  products  of 
fit  values,  is  a  method  for  constructing  fuzzy  matrices.  In  practice,  and  in  the  following 
discussion,  we  use  only  max-min  composition. 


FUZZY  HEBB  FAMs 

Most  fuzzy  systems  found  in  applications  are  fuzzy  Hebb  FAMs  [Kosko,  1986b].  They 
are  fuzzy  systems  S  :  /"  — *  Ip  constructed  in  a  simple  neural-like  manner.  As  discussed 
in  Chapter  4,  in  neural  network  theory  we  interpret  the  classical  Hebbian  hypothesis  of 
correlation  synaptic  learning  [Hebb,  1949]  as  unsupervised  learning  with  the  signal  product 
5,  Sj: 


rh,j  =  -niij  +  Si(xi)  Sj(yj)  .  (3) 

For  a  given  pair  of  bipolar  vectors  (A',  V  ),  the  neural  interpretation  gives  the  outer-product 
correlation  matrix 

M  =  XT  Y  .  (4) 

The  fuzzy  Hebb  matrix  is  similarly  defined  pointwise  by  the  minimum  of  the  “sig¬ 
nals”  a,  and  bj,  an  encoding  scheme  we  shall  call  correlation-minimum  encoding: 


mtJ  =  min(aj,6j)  ,  (5) 

given  in  matrix  notation  as  the  fuzzy  outer-product 

M  =  At  o  D  .  (6) 

Mamdani  [1977]  and  Togai  [1986]  independently  arrived  at  the  fuzzy  Ilebbian  prescrip¬ 
tion  (5)  as  a  multi-valued  logical-implication  operator:  truth(a,  — »  6.)  =  min(a,,  bj). 
The  min  operator,  though,  is  a  symmetric  truth  operator.  So  it  does  not  properly  gen¬ 
eralize  the  classical  implication  P  — >  Q,  which  is  false  if  and  only  if  the  antecedent  P 
is  true  and  the  consequent  Q  is  false,  t(P)  =  1  and  t(Q)  —  0.  In  contrast,  a  like  desire 
to  define  a  “conditional  possibility”  matrix  pointwise  with  continuous  implication  values 
led  Zadeh  [1983]  to  choose  the  Lukasiewicz  implication  operator:  mtJ  =  truth(a,  — * 
bj)  =  min(l,  1  —  a,  +  bj).  The  problem  with  the  Lukasiewicz  operator  is  that  it  usually 
unity.  For  min(l,  1  —  a,  +  bj)  <  1  iff  a,  >  bj.  Most  entries  of  the  resulting  matrix  M 
are  unity  or  near  unity.  This  ignores  the  information  in  the  association  (A,  B).  So  A!  o  M 
tends  to  equal  the  largest  fit  value  a'k  for  any  system  input  A'. 

We  construct  an  autoassociative  fuzzy  Hebb  FAM  matrix  by  encoding  the  redundant 
pair  (A,  A)  in  (6),  as  the  fuzzy  auto-correlation  matrix: 

M  =AT  o  A  .  (7) 

In  the  previous  example  the  matrix  M  was  such  that  the  input  A  =  (.3  .4  .8  1) 
recalled  fit  vector  B  =  (.8  .4  .5)  upon  max-min  composition:  A  o  M  =  B.  Will 

B  still  be  recalled  if  we  replace  the  original  matrix  M  with  the  fuzzy  Hebb  matrix  found 
with  (6)?  Substituting  A  and  B  in  (6)  gives 


f 

'  .3 

.3  .3  ^ 

.4 

.4 

.4  .4 

M  -  AT  o  B  - 

.8 

o  (.8  .4  .5)  = 

.8 

.4  .5 

<  1  J 

l  -8 

.4  .5  j 
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This  fuzzy  Hebb  matrix  h'  illustrates  two  key  properties.  First,  the  till  row  of  M  is 
the  pairwise  minimum  of  u,  and  the  output  associant  B.  Symmetrically,  the  jth  column 
of  M  is  the  pairwise  minimum  of  bj  and  the  input  associant  A: 


«i  A  B 


M  = 


[  a„  A  B  J 

=  [6]  A  AT  |  ...  1  bm  A  At\ 

where  the  cap  operator  denotes  pairwise  minimum:  «,  A  b: 
a.i  A  B  indicates  component-wise  minimum: 


(S) 

(9) 

mi n(a,,bj).  The  term 


e,  A  B  =  (a,  A  A  6n)  .  (10) 

Hence  if  some  a k  =  1,  then  the  kth  row  of  M  is  B.  If  some  6/  =  1,  the  /th  column  of 
M  is  A.  More  generally,  if  some  ak  is  at  least  as  large  as  every  bj,  then  the  kth  row  of  the 
fuzzy  Hebb  matrix  M  is  B. 

rowT 

Second,  the  third  and  fourth  eohH»«s  of  M  are  just  the  fit  vector  B.  Yet  no  column 
is  A.  This  allows  perfect  recall  in  the  forward  direction,  A  o  M  =  B,  but  not  in  the 
backward  direction,  B  o  MT  ^  A: 


A  o  M  =  (.8  .4  .5)  =  B  , 

B  o  Mt  =  (.3  .4  .8  .8)  =  A'  C  A  . 

A'  is  a  proper  subset  of  A  :  A'  ^  A  and  S(A',A)  =  1,  where  5  measures  the  degree  ol 
subsethood  of  A'  in  A,  as  discussed  in  Chapter  1G.  In  other  words,  a'  <  a,  for  each  i  and 
a'k  <  ah  for  at  least  one  k.  The  Bidirectional  FAM  Theorems  below  show  that  this  is  a 
general  property:  If  B'  -  A  o  M  differs  from  B,  then  B'  is  a  proper  subset  of  B.  Hence 
fuzzy  subsets  truly  map  to  fuzzy  subsets. 


1G 


The  Bidirectional  FAM  Theorem  for  Correlation-Minimum  En¬ 
coding 


Analysis  of  FAM  recall  uses  the  traditional  (Klir,  1988]  fuzzy  set  notions  of  the  height 
and  the  normality  of  fuzzy  sets.  The  height  //(A)  of  fuzzy  set  A  is  the  maximum  fit  value 
of  A: 


H(A)  =  max  a,  . 

l<i<n 

A  fuzzy  set  is  normal  if  H(A)  =  1,  if  at  least  one  fit  value  a *  is  maximal:  a ^  =  1.  In 
practice  fuzzy  sets  are  usually  normal.  YVe  can  extend  a  nonnormal  fuzzy  set  to  a  normal 
fuzzy  set  by  adding  a  dummy  dimension  with  corresponding  fit  value  «n+i  =  1. 

Recall  accuracy  in  fuzzy  Hebb  FAMs  constructed  with  correlation-minimum  encoding 
depends  on  the  heights  H(A)  and  H(B).  Normal  fuzzy  sets  exhibit  perfect  recall.  Indeed 
(A,  B )  is  a  bidirectional  fixed  point — A  o  M  ~  B  and  B  o  Mr  =  A — if  and  only  if 
H{A)  —  H(B),  which  always  holds  if  A  and  B  are  normal.  This  is  the  content  of  the 
Bidirectional  FAM  Theorem  [Kosko,  1986a]  for  correlation-minimum  encoding.  Below  we 
present  a  similar  theorem  for  correlation-product  encoding. 


Correlation-Minimum  Bidirectional  FAM  Theorem.  If  M  =  Ar  o  B,  then 


(0 

A  o  M 

= 

B 

iff 

H(A)  >  H(B)  , 

(ii) 

B  o  Mt 

= 

A 

iff 

//(/?)  >  H{A)  , 

(iii) 

A'  o  M 

C 

B 

for  any  A'  . 

(iv) 

B'  o  Mt 

C 

A 

for  any  B'  . 

Proof.  Observe  that  the  height  H{A)  is  the  fuzzy  norm  of  A: 


17 


Then 


A  o  At  =  max  «,  A  a,  =  max  <z,  =  If  (A) 

i  i 

A  o  M  =  A  o  ( At  o  B) 

=  (A  o  At)  o  B 


=  H{A)  o  5 
=  Ii{A)  A  5  . 

So  H{A)  A  B  =  B  iff  //(A)  >  //(/?),  establishing  (i).  Now  suppose  A'  is  an  arbitrary 
fit  vector  in  /n.  Then 


A'  o  M  =  (A'  o  Ar)  o  5 

=  o  AT)  A  B  , 

which  establishes  (iii).  A  similar  argument  using  MT  —  BT  o  A  establishes  (ii)  and  (iv). 

Q.E.D. 


The  equality  A  o  Ar  =  //(A)  implies  an  immediate  corollary  of  the  Bidirectional 

FAM  Theorem.  Supersets  A'  D  A  behave  the  same  as  the  encoded  input  associant 
A  :  A'  o  M  =  B  if  A  o  M  =  B.  Fuzzy  Hebb  FAMs  ignore  the  information  in  the 
difference  A'  —  A,  when  A'  C  A'. 


Correlation-Product  Encoding 

An  alternative  fuzzy  Hebbian  encoding  scheme  is  correlation-product  encoding. 
The  standard  mathematical  outer  product  of  the  fit  vectors  A  and  B  forms  the  FAM 
matrix  M .  This  is  given  pointwisc  as 
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and  in  matrix  notation  as 


M  =  At  B 


(12) 


So  the  ith  row  of  M  is  just  the  fit-scaled  fuzzy  set  a,  B,  and  the  jth  column  of  M  is  bj  A1 : 


M 


ai  B 


an  B 


[bt  AT|  ...  |  bm  At J 


(13) 


(14) 


If  A  =  (.3  .4  .8  1)  and  B  =  (.8  .4  .5)  as  above,  we  encode  the  FAM  rule  (A,  B)  with 
correlation- product  in  the  following  matrix  M: 


f  .24 

.12 

.15 

\ 

.32 

.16 

.2 

.64 

.32 

.4 

^.8 

.4 

.5 

/ 

Note  that  if  A'  =  (0  0  0  1),  then  A'  o  M  =  B.  The  output  associant  B  is  recalled 
to  maximal  degree.  If  A'  =  (1  0  0  0),  then  A!  o  M  =  (.24  .12  .15).  The  output  B  is 
recalled  only  to  degree  .3. 

Correlation-minimum  encoding  produces  a  matrix  of  clipped  B  sets.  Correlation- 
product  encoding  produces  a  matrix  of  scaled  B  sets.  In  membership  function  plots, 
the  scaled  fuzzy  sets  a,  B  all  have  the  same  shape  as  B.  The  clipped  fuzzy  sets  a,  A  B 
are  largely  flat.  In  this  sense  correlation-product  encoding  preserves  more  information 
than  correlation-minimum  encoding,  an  important  point  in  fuzzy  applications  when  out¬ 
put  fuzzy  sets  are  added  together  as  in  equation  (17)  below.  In  the  fuzzy-applications 
literature  this  often  leads  to  the  selection  of  correlation-product  encoding. 


Unfortunately,  in  the  fuzzy-applications  literature  the  correlation-product  encoding 
scheme  is  invariably  confused  with  the  max-product  composition  method  of  recall  or  infer¬ 
ence :,  as  mentioned  above.  This  confusion  is  so  widespread  it  warrants  formal  clarification. 

In  practice,  and  in  the  fuzzy  control  applications  developed  in  Chapters  18  and  19,  the 
input  fuzzy  set  A'  is  a  binary  vector  with  one  1  and  all  other  elements  0 — a  row  of  the 
n-by-n  identity  matrix.  A'  represents  the  occurrence  of  the  crisp  measurement  datum  x,, 
such  as  a  traffic  density  value  of  30.  When  applied  to  the  encoded  FAM  rule  (A,  /?),  the 
measurement  value  x,-  activates  A  to  degree  a,-.  This  is  part  of  the  max-min  composition 
recall  process,  for  A'  o  M  =  (A1  o  AT)  o  B  =  a;  A  B  or  a,  B  depending  on  whether 
correlation-minimum  or  correlation-product  encoding  is  used.  We  activate  or  “fire”  the 
output  associant  B  of  the  “rule”  to  degree  a,. 

Since  the  values  a,  are  binary,  a,  mtJ  =  a,  A  m,j.  So  the  max-min  and  max- 
product  composition  operators  coincide.  We  avoid  this  confusion  by  referring  to  both 
the  recall  process  and  the  correlation  encoding  scheme  as  correlation-minimum  infer¬ 
ence  when  correlation-minimum  encoding  is  combined  with  max-min  composition,  and 
as  correlation-product  inference  when  correlation-product  encoding  is  combined  with 
max-min  composition. 

We  now  prove  the  correlation-product  version  of  the  Bidirectional  FAM  Theorem. 

Correlation-Product  Bidirectional  FAM  Theorem.  If  M  ~  AT  B  and  A  and  B 
are  non-null  fit  vectors,  then 

(i)  A  o  M  =  B  ifT  H{A)  =  1  , 

(ii)  B  o  Mt  =  A  ifT  H(B)  =  1  , 

(iii)  A'  o  M  C  B  for  any  A!  . 

(iv)  B'  o  MT  C  A  for  any  B'  . 
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Proof. 


A  o  M  =  A  o  (  At  B ) 

=  (A  o  At)  B 
=  11(A)  B  . 

Since  B  is  not  the  empty  set,  H(A)  B  -  B  iff  H(A)  =?  1.,  establishing  (i).  (  A  o  M  =  B 
holds  trivially  if  B  is  the  empty  set.)  For  an  arbitrary  fit  vector  A'  in  In: 


A'  o  M  =  (A1  o  At )  B 
C  H(A)  B 
C  B  , 

since  A!  o  A  <  H(A),  establishing  (iii).  (ii)  and  (iv)  are  proved  similarly  using 

Mt  =  BT  A.  Q.E.D. 


Superimposing  FAM  Rules 


Now  suppose  we  have  m  FAM  rules  or  associations  (Aj,  ^i), . .  - ,  (Am,  Bm).  The  fuzzy 
Hebb  encoding  scheme  (6)  leads  to  m  FAM  matrices  A/i, . . . ,  Mm  to  encode  the  associa¬ 
tions.  The  natural  neural-network  temptation  is  to  add,  or  in  this  case  maximum,  the  m 
matrices  pointwise  to  distributively  encode  the  associations  in  a  single  matrix  M: 

M  —  max  Mk  (15) 

l<fc<rn 

This  superimposition  scheme  fails  for  fuzzy  Hebbian  encoding.  The  superimposed  result 
tends  to  be  the  matrix  Aro  B ,  where  A  and  B  arc  the  pointwise  maximum  of  the  respective 
m  fit  vectors  Ajt  and  Bk.  We  can  see  this  from  the  pointwise  inequality 
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(16) 


max  min(a?,6;)  <  min(  max  of,  max  6*)  . 

l<Jfc<m  v  *  }>  ~~  'l  <lc<m  '  l<fc<m  }' 


Inequality  (16)  tends  to  hold  with  equality  as  m  increases  since  all  maximum  terms  ap¬ 
proach  unity.  We  lose  the  information  in  the  m  associations  (A*,  Bk). 

The  fuzzy  approach  to  the  superimposition  problem  is  to  additively  superimpose  the  m 
recalled  vectors  Bk  instead  of  the  fuzzy  Hebb  matrices  A/*.  B'k  and  Mu  are  given  by 

A  o  Mk  —  A  o  {Aj.  o  Bk) 

=  K  , 

for  any  fit- vector  input  A  applied  in  parallel  to  the  bank  of  FAM  rules  ( Ak,Bk ).  This 
requires  separately  storing  the  m  associations  (Ak,Bk),  as  if  each  association  in  the  FAM 
bank  were  a  separate  feedforward  neural  network. 

Separate  storage  of  FAM  associations  is  costly  but  provides  an  “audit  trail”  of  the 
FAM  inference  procedure.  The  user  can  directly  determine  which  FAM  rules  contributed 
how  much  membership  activation  to  a  “concluded”  output.  Separate  storage  also  pro¬ 
vides  knowledge-base  modularity.  The  user  can  add  or  delete  FAM-structured  knowledge 
without  disturbing  stored  knowledge.  Both  of  these  benefits  are  advantages  over  a  pure 
neural -network  architecture  for  encoding  the  same  associations  (A; t,  Bk).  Of  course  we  can 
use  neural  networks  exogenously  to  estimate,  or  even  individually  house,  the  associations 

(AkJh). 

Separate  storage  of  FAM  rules  brings  out  another  distinction  between  FAM  systems 
and  neural  networks.  A  fit-vector  input  A  activates  all  the  FAM  rules  {Ak,Bk)  in  parallel 
but  to  different  degrees.  If  A  only  partially  “satisfies”  the  antecedent  associant  Ak,  the 
consequent  associant  Bk  is  only  partially  activated.  If  A  docs  not  satisfy  Ak  at  all,  Bk  docs 
not  activate  at  all.  B'k  is  the  null  vector. 

Neural  networks  behave  differently.  They  try  to  reconstruct  the  entire  association 
( Ak,Bk )  when  stimulated  with  A.  If  A  and  Ak  mismatch  severely,  a  neural  network  will 


tend  to  emit  a  non-null  output  B'k ,  perhaps  the  result  of  the  network  dynamical  system 
falling  into  a  “spurious”  attractor  in  the  state  space.  This  may  be  desirable  for  metrical 
classification  problems.  It  is  undesirable  for  inferential  problems  and,  arguably,  for  associa¬ 
tive  memory  problems.  When  we  ask  an  expert  a  question  outside  his  field  of  knowledge, 
in  many  cases  it  is  more  prudent  for  him  to  give  no  response  than  to  give  an  educated, 
though  wild,  guess. 


Recalled  Outputs  and  “Defuzzification” 

1'hc  recalled  fit-vector  output  B  is  a.  weighted  sum  of  the  individual  recalled  vectors 
B'h: 


B  =  E  K  .  (H) 

k=l 

where  the  nonnegative  weight  wk  summarizes  the  credibility  or  strength  of  the  kth  FAM 
rule  (Afc,  Bk).  The  credibility  weights  to*  are  immediate  candidates  for  adaptive  modifica¬ 
tion.  In  practice  we  choose  wx  =  ...  =  wm  =  1  as  a  default. 

In  principle,  though  not  in  practice,  the  recalled  fit-vector  output  is  a  normalized  sum 
of  the  B'k  fit  vectors.  This  keeps  the  components  of  B  unit-interval  valued.  We  do  not 
use  normalization  in  practice  because  we  in  ariably  “defuzzify”  the  output  distribution  B 
to  produce  a  single  numerical  output,  a  single  value  in  the  output  universe  of  discourse 
Y  ~  { Z/i ,  •  •  •  ,!/P}-  The  information  in  the  output  waveform  B  resides  largely  in  the 

relative  values  of  the  membership  degrees. 

The  simplest  defuzzification  scheme  is  to  choose  that,  element  7/,nax  that  has  maximal 
membership  in  the  output  fuzzy  set  B: 

=  max  ?««(?/,).  (18) 

1  <J<k 

I  he  popular  probabilistic  methods  of  maximum-likelihood  and  maximum-a-postcriori  pa¬ 
rameter  estimation  motivate  this  maximum-membership  defuzzification  scheme.  The 


maximum-membership  scheme  (18)  is  also  computationally  light. 

There  arc  two  fundamental  problems  with  the  maximum-membership  defuzzification 
scheme.  First,  the  mode  of  the  3  distribution  is  not  unique.  This  is  especially  troublesome 
with  correlation-minimum  encoding,  as  the  representation  (8)  shows,  and  somewhat  less 
troublesome  with  correlation-product  encoding.  Since  the  minimum  operator  clips  ofT  the 
top  of  the  Bk  fit  vectors,  the  additively  combined  output  fit  vector  B  tends  to  be  flat  over 
many  regions  of  universe  of  discourse  Y .  For  continuous  membership  functions  this  leads 
to  infinitely  many  modes.  Even  for  quantized  fuzzy  sets,  there  may  be  many  modes. 

In  practice  we  can  average  multiple  modes.  For  large  FAM  banks  of  “independent” 
FAM  rules,  some  form  of  the  Central  Limit  Theorem  (whose  proof  ultimately  depends 
on  Fourier  transformability  not  probability)  tends  to  apply.  The  waveform  B  tends  to 
resemble  a  Gaussian  membership  function.  So  a  unique  mode  tends  to  emerge.  It  tends 
to  emerge  with  fewer  samples  if  we  use  correlation-product  encoding. 

Second,  the  maximum-membership  scheme  ignores  the  information  in  much  of  the 
waveform  B.  Again  correlation-minimum  encoding  compounds  the  problem.  In  practice 
B  is  often  highly  asymmetric,  even  if  it  is  unimodal.  Infinitely  many  output  distributions 
can  share  the  same  mode. 

The  natural  alternative  is  the  fuzzy  centroid  defuzzification  scheme.  We  directly 
compute  the  real-valued  output  as  a  normalized  convex  combination  of  fit  values,  the  fuzzy 
centroid  B  of  fit-vector  B  with  respect  to  output  space  Y: 

p 

Y  Vj  mB(y} ) 

B  =  -  .  (19) 

Y  m»(y>) 

j=i 

The  fuzzy  centroid  is  unique  and  uses  all  the  information  in  the  output  distribution  B.  For 
symmetric  unimodal  distributions  the  mode  and  fuzzy  centroid  coincide.  In  many  cases 
we  must  replace  the  discrete  sums  in  (19)  with  integrals  over  continuously  infinite  spaces. 
We  show  in  Chapter  1 9,  though,  that,  for  libraries  of  trapezoidal  fuzzy  sets  we  can  replace 
such  a  ratio  of  integrals  with  a  ratio  of  simple  discrete  sums. 

Note  that  computing  the  centroid  (19)  is  the  only  step  in  the  FAM  inference  procedure 


that  requires  division.  All  other  operations  are  inner  products,  pairwise  minima,  and  ad¬ 
ditions.  This  promises  realization  in  a  fuzzy  optical  processor.  Already  some  form  of  this 
FAM-infercncc  scheme  has  led  to  digital  [Togai,  1986]  and  analog  [Yamakawa,  1987-88] 
VLSI  circuitry. 


FAM  System  Architecture 


Figure  17.3  schematizes  the  architecture  of  the  nonlinear  FAM  system  F.  Note  that  F 
maps  fuzzy  sets  to  fuzzy  sets:  F(A)  =  D.  So  F  is  in  fact  a  fuzzy-system  transformation 
F  :  In  — >  lp.  In  practice  A  is  a  bit  vector  with  one  unity  value,  a,  =  1,  and  all  other 
fit  values  zero,  a}  —  0. 

The  output  fuzzy  set  B  is  usually  defuzzified  with  the  centroid  technique  to  produce  an 
exact  element  yj  in  the  output  universe  of  discourse  Y.  In  effect  defuzzification  produces 
an  output  binary  vector  O,  again  with  one  element  1  and  the  rest  Os.  At  this  level  the  FAM 
system  F  maps  sets  to  sets,  reducing  the  fuzzy  system  F  to  a  mapping  between  Boolean 
cubes,  F  :  {0,  l}n  — ►  {0, 1}P.  In  many  applications  we  model  X  and  Y  as  continuous 
universes  of  discourse.  So  n  and  p  are  quite  large.  We  shall  call  such  systems  binary 
input-output  FAMs. 
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FAM  Rule  1 


FAM  SYSTEM 


FIGURE  17.3  FAM  system  architecture.  The  FAM  system  F  maps  fuzzy 
sets  in  the  unit  cube  In  to  fuzzy  sets  in  the  unit  cube  Ip.  Binary  input  fuzzy 
sets  are  often  used  in  practice  to  model  exact  input  data.  In  general  only  an 
uncertainty  estimate  of  the  system  state  is  available.  So  A  is  a  proper  fuzzy  set. 

The  user  can  defuzzify  output  fuzzy  set  B  to  yield  exact  output  data,  reducing 
the  FAM  system  to  a  mapping  between  Boolean  cubes. 

Binary  Input-Output  FAMs:  Inverted  Pendulum  Example 

Binary  input-output  FAMs  (BIOFAMs)  are  the  most  popular  fuzzy  systems  for  appli¬ 
cations.  BIO  FA  Ms  map  system  state- variable  data  to  control  data.  In  the  case  of  traflic 
control,  a  BIOFAM  maps  traffic  densities  to  green  (and  red)  light  durations. 

BIOFAMs  easily  extend  to  multiple  FAM  rule  antecedents,  to  mappings  from  product 
cubes  to  product  cubes.  There  has  been  little  theoretical  justification  for  this  extension, 
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aside  from  Mamdani’s  [1977]  original  suggestion  to  multiply  relational  matrices.  The  ex¬ 
tension  to  multi-antecedent  FAM  rules  is  easier  applied  than  formally  explained.  In  the 
next  section  we  present  a  general  explanation  for  dealing  with  multi-antecedent  FAM  rules. 
First,  though,  we  present  the  BIOFAM  algorithm  by  illustrating  it,  and  the  FAM  construc¬ 
tion  procedure,  on  an  archetypical  control  problem. 

Consider  an  inverted  pendulum.  In  particular,  consider  how  to  adjust  a  motor  to  bal¬ 
ance  an  inverted  pendulum  in  two  dimensions.  The  inverted  pendulum  is  a  classical  control 
problem.  It  admits  a  math-model  control  solution.  This  provides  a  formal  benchmark  for 
BIOFAM  pendulum  controllers. 

There  are  two  state  variables  and  one  control  variable.  The  first  state  variable  is  the 
angle  0  that  the  pendulum  shaft  makes  with  the  vertical.  Zero  angle  corresponds  to  the 
vertical  position.  Positive  angles  are  to  the  right  of  the  vertical,  negative  angles  to  the  left. 

The  second  state  variable  is  the  angular  velocity  AO.  In  practice  we  approximate  the 
instantaneous  angular  velocity  AO  as  the  difference  between  the  present  angle  measurement 
0t  and  the  previous  angle  measurement  0t_ j: 

A0t  =0t  -  $t— 1  • 

The  control  variable  is  the  motor  current  or  angular  velocity  vt.  The  velocity  can  also 
be  positive  or  negative.  We  expect  that  if  the  pendulum  falls  to  the  right,  the  motor 
velocity  should  be  negative  to  compensate.  If  the  pendulum  falls  to  the  left,  the  motor 
velocity  should  be  positive.  If  the  pendulum  successfully  balances  at  the  vertical,  the  motor 
velocity  should  be  zero. 

The  real  line  R  is  the  universe  of  discourse  of  the  three  variables.  In  practice  we 
restrict  each  universe  of  discourse  to  a  comparatively  small  interval,  such  as  [—90,90]  for 
the  pendulum  angle,  centered  about  zero. 

We  can  quantize  each  universe  of  discourse  into  five  overlapping  fuzzy  sets.  Wc  know 
that  the  system  variables  can  be  positive,  zero,  or  negative.  We  can  quantize  the  magni¬ 
tudes  of  the  system  variables  finely  or  coarsely.  Suppose  we  quantize  the  magnitudes  as 
small,  medium,  and  large.  This  leads  to  seven  linguistic  fuzzy  set  values: 
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NL: 

Negative  Large 

NM: 

Negative  Medium 

NS: 

Negative  Small 

ZE: 

Zero 

PS: 

Positive  Small 

PM: 

Positive  Medium 

PL: 

Positive  Large 

For  example,  0  is  a  fuzzy  variable  that  takes  N L  as  a  fuzzy  set  value.  Different  fuzzy 
quantizations  of  the  angle  universe  of  discourse  allow  the  fuzz}'  variable  0  to  assume  differ¬ 
ent  fuzzy  set  values.  The  expressive  power  of  the  FAM  approach  stems  from  these  fuzzy-set 
quantizations.  In  one  stroke  we  reduce  system  dimensions,  and  we  describe  a  nonlinear 
numerical  process  with  linguistic  common-sense  terms. 

We  are  not  concerned  with  the  exact  shape  of  the  fuzzy  sets  defined  on  each  of  the 
three  universes  of  discourse.  In  practice  the  quantizing  fuzzy  sets  are  usually  symmetric 
triangles  or  trapezoids  centered  about  representive  values.  (We  can  think  of  such  sets  as 
fuzzy  numbers.)  The  set  ZE  may  be  a  Gaussian  curve  for  the  pendulum  angle  0,  a  triangle 
for  the  angular  velocity  A0,  and  a  trapezoid  for  the  velocity  v.  But  all  the  ZE  fuzzy  sets 
will  be  centered  about  the  numerical  value  zero,  which  will  have  maximum  membership  in 
the  set  of  zero  values. 

How  much  should  contiguous  fuzzy  sets  overlap?  This  design  issue  depends  on  the 
problem  at  hand.  Too  much  overlap  blurs  the  distinction  between  the  fuzzy  set  values. 
Too  little  overlap  tends  to  resemble  bivalent  control,  producing  overshoot  and  undershoot. 
In  Chapter  19  we  determine  experimentally  the  following  default  heuristic  for  ideal  overlap: 
Contiguous  fuzzy  sets  in  a  library  should  overlap  approximately  25%. 

FAM  rules  are  triples,  such  as  (NM,Z\  PM).  They  describe  how  to  modify  the  con¬ 
trol  variable  for  observed  values  of  the  pendulum  state  variables.  A  FAM  rule  associates 
a  motor-velocity  fuzzy  set  value  with  a  pendulum-angle  fuzzy  set  value  and  an  angular- 
velocity  fuzzy  set  value.  So  we  can  interpret  the  triple  ( NM,Z\  PM)  as  the  set-level 
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implication 


IF  the  pendulum  angle  0  is  negative  but  medium 

AND  the  angular  velocity  AO  is  about  zero  , 

THEN  the  motor  velocity  should  be  positive  but  medium  . 


These  commonsensical  FAM  rules  are  comparatively  easy  to  articulate  in  natural  language. 
Consider  a  terser  linguistic  version  of  the  same  three-antecedent  FAM  rule: 


IF  0  =  NM  AND  AO  =  ZE  , 

THEN  v  =  PM  . 

Even  this  mild  level  of  formalism  may  inhibit  the  knowledge  acquisition  process.  On  the 
other  hand,  the  still  terser  FAM  triple  ( NM ,  ZE;  PM)  allows  knowledge  to  be  acquired 
simply  by  filling  in  a  few  entries  in  a  linguistic  FAM-bank  matrix.  In  practice  this  often 
allows  a  working  system  to  be  developed  in  hours,  if  not  minutes. 

We  specify  the  pendulum  FAM  system  when  we  choose  a  FAM  bank  of  two-antecedent 
FAM  rules.  Perhaps  the  first  FAM  rule  to  choose  is  the  steady-state  FAM  rule:  ( ZE ,  ZE;  ZE). 
The  steady-state  FAM  rule  describes  what  to  do  in  equilibrium.  For  the  inverted  pendulum 
we  should  do  nothing. 

This  is  typical  of  many  control  problems  that  require  nulling  a  scalar  error  measure. 

We  can  control  multivariable  problems  by  nulling  the  norms  of  the  system  error  vector 
and  error-velocity  vectors,  or,  better,  by  directly  nulling  the  individual  scalar  variables. 
(Chapter  19  shows  how  error  nulling  can  control  a  realtime  target  tracking  system.)  Error 
nulling  tractably  extends  the  FAM  methodology  to  nonlinear  estimation,  control,  and 
decision  problems  of  high  dimension. 


29 


The  pendulum  FAM  bank  is  a  7-by-7  matrix  with  linguistic  fuzzy-set  entries.  We  index 
the  columns  by  the  seven  fuzzy  sets  that  quantize  the  angle  0  universe  of  discourse.  We 
index  the  rows  by  the  seven  fuzzy  sets  that  quantize  the  angular  velocity  AO  universe  of 
discourse. 

Each  matrix  entry  is  one  of  seven  motor- velocity  fuzzy-set  values.  Since  a  FAM  rule  is  a 
mapping  or  function,  there  is  exactly  one  output  velocity  value  for  every  pair  of  angle  and 
angular- velocity  values.  So  the  49  entries  in  the  FAM  bank  matrix  represent  the  49  possible 
two-antecedent  FAM  rules.  In  practice  most  of  the  entries  are  blank.  In  the  adaptive  FAM 
case  discussed  below,  we  adaptively  generate  the  entries  from  process  sample  data. 

Commonsense  dictates  the  entries  in  the  pendulum  FAM  bank  matrix.  Suppose  the 
pendulum  is  not  changing.  So  A 6  =  ZE.  If  the  pendulum  is  to  the  right  of  vertical, 
the  motor  velocity  should  be  negative  to  compensate.  The  farther  the  pendulum  is  to 
the  right,  the  larger  the  negative  motor  velocity  should  be.  The  motor  velocity  should 
be  positive  if  the  pendulum  is  to  the  left.  So  the  fourth  row  of  the  FAM  bank  matrix, 
which  corresponds  to  AO  =  ZE,  should  be  the  ordinal  inverse  of  the  6  row  values.  This 
assignment  includes  the  steady-state  FAM  rule  (ZE,  ZE]  ZE). 

Now  suppose  the  angle  0  is  zero  but  the  pendulum  is  moving.  If  the  angular  velocity  is 
negative,  the  pendulum  will  overshoot  to  the  left.  So  the  motor  velocity  should  be  positive 
to  compensate.  If  the  angular  velocity  is  positive,  the  motor  velocity  should  be  negative. 
The  greater  the  angular  velocity  is  in  magnitude,  the  greater  the  motor  velocity  should 
be  in  magnitude.  So  the  fourth  column  of  the  FAM  bank  matrix,  which  corresponds  to 
0  =  Z E,  should  be  the  ordinal  inverse  of  the  AO  column  values.  This  assignment  also 
includes  the  steady-state  FAM  rule. 

Positive  0  values  with  negative  AO  values  should  produce  negative  motor  velocity  values, 
since  the  pendulum  is  heading  toward  the  vertical.  So  (PS,  NS ;  NS)  is  a  candidate  FAM 
rule.  Symmetrically,  negative  0  values  with  positive  AO  values  should  produce  positive 
motor  velocity  values.  So  (NS,  PS;  PS)  is  another  candidate  FAM  rule. 

This  gives  15  FAM  rules  altogether.  In  practice  these  rules  are  more  than  sufficient  to 
successfully  balance  an  inverted  pendulum.  Different,  and  smaller,  subsets  of  FAM  rules 
may  also  successfully  balance  the  pendulum. 


We  can  represent  the  bank  of  15  FAM  rules  as  the  7-by-7  linguistic  matrix 


0 


The  BIOFAM  system  F  also  admits  a  geometric  interpretation.  The  set  of  all  possible 
input-outpairs  (0,  A0\  F(6,  AO))  defines  a  FAM  surface  in  the  input-output  product  space, 
in  this  case  in  R3.  We  plot  examples  of  these  control  surfaces  in  Chapters  18  and  19. 

The  BIOFAM  inference  procedure  activates  in  parallel  the  antecedents  of  all  15  FAM 
rules.  The  binary  or  pulse  nature  of  inputs  picks  off  single  fit  values  from  the  quantizing 
fuzzy  sets.  We  can  use  either  the  correlation-minimum  or  correlation-product  inferenc- 
ing  technique.  For  simplicity  we  shall  illustrate  the  procedure  with  correlation-minimum 
inferencing. 

Suppose  the  current  pendulum  angle  0  is  15  degrees  and  the  angular  velocity  AO  is 
-10.  This  amounts  to  passing  two  bit  vectors  of  one  1  and  all  else  0  through  the  BIOFAM 
system.  What  is  the  corresponding  motor  velocity  value  v  =  F(  15,  —10)? 

Consider  first  how  the  input  data  pair  (15,  -10)  activates  steady-state  FAM  rule  ( ZE,ZE ; 
ZE).  Suppose  we  define  the  antecedent  and  consequent  fuzzy  sets  for  ZE  with  the  trian¬ 
gular  fuzzy  set  membership  functions  in  Figure  17.4.  Then  the  angle  datum  15  is  a  zero 
angle  value  to  degree  .2  :  meZE(  15)  =  .2.  The  angular  velocity  datum  -10  is  a  zero 
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angular  velocity  value  to  degree  .5  :  mZB  (-10)  =  .5. 

We  combine  the  antecedent  fit  values  with  minimum  or  maximum  according  as  the 
antecedent  fuzzy  sets  are  combined  with  the  conjunctive  AND  or  the  disjunctive  OR. 
Intuitively,  it  should  be  at  least  as  difficult  to  satisfy  both  antecedent  conditions  as  to 
satisfy  cither  one  separately. 

The  FAM  rule  notation  ( ZE,ZE ;  ZE)  implicitly  assumes  that  antecedent  fuzzy  sets 
are  combined  conjunctively  with  AND.  So  the  data  satisfy  the  compound  antecedent  of 
the  FAM  rule  ( ZE ,  ZE;  ZE)  to  degree 

min(m|g(15),  mfg(-lO))  =  min(.2,  .5) 

=  .2 

Clearly  this  methodology  extends  to  any  number  of  antecedent  terms  connected  with  ar¬ 
bitrary  logical  (set-theoretical)  connectives. 

The  system  should  now  activate  the  consequent  fuzzy  set  of  zero  motor  velocity  values 
to  degree  .2.  This  is  not  the  same  as  activating  the  ZE  motor  velocity  fuzzy  set  100%  with 
probability  .2,  and  certainly  not  the  same  as  Prob{v  =  0}  =  .2.  Instead  a  deterministic 
20%  of  ZE  should  result  and,  according  to  the  additive  combination  formula  (17),  should 
be  added  to  the  final  output  fuzzy  set. 

The  correlation-minimum  inference  procedure  activates  the  angular  velocity  fuzzy  set 
ZE  to  degree  .2  by  taking  the  pairwise  minimum  of  .2  and  the  ZE  fuzz}'  set  mvZE: 

min(m|B(15),  mZE(- 10))  A  mvZE(v )  =  .2  A  mvZE(v) 

for  all  velocity  values  v.  The  correlation-product  inference  procedure  would  simply  multiply 
the  zero  angular  velocity  fuzzy  set  by  .2  :  .2  mvZE(v)  for  all  v. 

The  data  similarly  activate  the  FAM  rule  ( PS,ZE ;  NS)  depicted  in  Figure  17. '1.  The 
angle  datum  15  is  a  small  but  positive  angle  value  to  degree  .8.  The  angular  velocity  datum 
-10  is  a  zero  angular  velocity  value  to  degree  .5.  So  the  output  motor  velocity  fuzzy  set  of 
small  but  negative  motor  velocity  values  is  scaled  by  .5,  the  lesser  of  the  two  antecedent 
fit  values: 
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min(mpS(15),  m§£(-10))  A  rnvNS(v)  =  .5  A  ?n^s(u) 

for  all  velocity  values  v.  So  the  data  activate  the  FAM  rule  ( PS ,  ZE\  NS)  to*greater  degree 
than  the  steady-state  FAM  rule  ( ZE ,  ZE\  ZE)  since  in  this  example  an  angle  value  of  15 
degrees  is  more  a  small  but  positive  angle  value  than  a  zero  angle  value. 

The  data  similarly  activate  the  other  13  FAM  rules.  We  combine  the  resulting  minimum- 
scaled  consequent  fuzzy  sets  according  to  (17)  by  summing  pointwise.  We  can  then  com¬ 
pute  the  fuzzy  centroid  with  equation  (19),  with  perhaps  integrals  replacing  the  discrete 
sums,  to  determine  the  specific  output  motor  velocity  v.  In  Chapter  19  we  show  that,  for 
symmetric  fuzzy  sets  of  quantization,  the  centroid  can  always  be  computed  exactly  with 
simple  discrete  sums  even  if  the  fuzzy  sets  are  continuous.  In  many  realtime  applications 
we  must  repeat  this  entire  FAM  inference  procedure  hundreds,  perhaps  thousands,  of  times 
per  second.  This  requires  fuzzy  VLSI  or  optical  processors. 

Figure  17.4  illustrates  this  equal-weight  additive  combination  procedure  for  just  the 
FAM  rules  (ZE,  ZE ;  ZE)  and  (PS,  ZE ;  NS).  The  fuzzy-centroidal  motor  velocity  value 
in  this  case  is  -3. 
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FIGURE  17.4  FAM  correlation- mini  mum  inference  procedure.  The  FAM 
system  consists  of  the  two  two-antecedent  FAM  rules  (PS,  ZE ;  NS)  and 
( ZE,ZE ;  ZE).  The  input  angle  datum  is  15,  and  is  more  a  small  but  pos¬ 
itive  angle  value  than  a  zero  angle  value.  The  input  angular  velocity  datum 
is  -10,  and  is  only  a  zero  angular  velocity  value  to  degree  .5.  Antecedent  fit 
values  are  combined  with  minimum  since  the  antecedent  terms  are  combined 
conjunctively  with  AND.  The  combined  fit  value  then  scales  the  consequent 
fuzzy  set  with  pairwise  minimum.  The  minimum-scaled  output  fuzzy  sets  are 
added  pointwise.  The  fuzzy  centroid  of  this  output  waveform  is  computed  and 
yields  the  system  output  velocity  value  -3. 


Multi-Antecedent  FAM  Rules:  Decompositional  Inference 


The  BIOFAM  inference  procedure  treats  antecedent  fuzzy  sets  as  if  they  were  propo¬ 
sitions  with  fuzzy  truth  values.  This  is  because  fuzzy  logic  corresponds  to  1 -dimensional 
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fuzzy  set  theory  and  because  we  use  binary  or  exact  inputs.  We  now  formally  develop  the 
connection  between  BIOFAMs  and  the  FAM  theory  presented  earlier. 

Consider  the  compound  FAM  rule  “IF  X  is  A  AND  Y  is  D  ,  THEN  C  is  Z,” 
or  ( A,B ;  C)  for  short.  Let  the  universes  of  discourse  X,  K,  and  Z  have  dimensions  n,  p, 
and  <7  :  X  -  {xl5 . . .  ,zn},  Y  =  {y,, . . .  ,yp},  and  Z  =  {zi,...,z,}.  We  can  directly 
extend  this  framework  to  multiple  antecedent  and  consequent  terms. 

In  our  notation  X ,  Y,  and  Z  are  both  universes  of  discourse  and  fuzzy  variables.  The 
fuzzy  variable  X  can  assume  the  fuzzy  set  values  Ai,A2,..-,  and  similarly  for  the  fuzzy 
variables  Y  and  Z.  When  controlling  an  inverted  pendulum,  the  identification  aX  is  A” 
might  represent  the  natural-language  description  “The  pendulum  angle  is  positive  but 
small.” 

What  is  the  matrix  representation  of  the  FAM  rule  (A,  B\  C)?  The  question  is  nontriv¬ 
ial  since  A,  B,  and  C  are  fuzzy  subsets  of  different  universes  of  discourse,  points  in  different 
unit  cubes.  Their  dimensions  and  interpretations  differ.  Mamdani  [1977]  and  others  have 
suggested  representing  such  rules  as  fuzzy  multidimensional  relations  or  arrays.  Then  the 
FAM  rule  (A,  B\  C)  would  be  a  fuzzy  subset  of  the  product  space  X  x  Y  x  Z.  This  rep¬ 
resentation  is  not  used  in  practice  since  only  exact  inputs  are  presented  to  FAM  systems 
and  the  BIOFAM  procedure  applies.  If  we  presented  the  system  with  a  genuine  fuzzy  set 
input,  we  would  no  doubt  preprocess  the  fuzzy  set  with  a  centroidal  or  maximum-fit-value 
technique  so  we  could  still  apply  the  BIOFAM  inference  procedure. 

We  present  an  alternative  representation  that  decomposes,  then  recomposes,  the  FAM 
rule  (A,  B\  C )  in  accord  with  the  FAM  inference  procedure.  This  representation  allows 
neural  networks  to  adaptively  estimate,  store,  and  modify  the  decomposed  FAM  rules.  The 
representation  requires  far  less  storage  than  the  multidimensional-array  representation. 

Let  the  fuzzy  Hebb  matrices  Mac  and  Mbc  store  the  simple  FAM  associations  (A,C) 
and  ( B,C ): 


Mac  =  AT  o  C  ,  (20) 

Mbc  =  BT  o  C  .  (21) 
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The  fuzzy  Hebb  matrices  Mac  and  MgC  split  the  compound  FAM  rule  (A,  J3\  C).  We  can 
construct  the  splitting  matrices  with  correlation-product  encoding. 

Let  I'x  =  (0 . . .  0  1  0  . .  .  0)  be  an  n-dimcnsional  bit  vector  with  tth  element  1  and  all 
other  elements  0.  J'x  is  the  tth  row  of  the  n-by-ti  identity  matrix.  Similarly,  l\,  and  /|  are 
the  respective  jth  and  /rth  rows  of  the  />-by-p  and  <7-by-<7  identity  matrices.  The  bit  vector 
I'x  represents  the  occurrence  of  the  exact  input  x,. 

We  will  call  the  proposed  FAM  representation  scheme  FAM  decompositional  infer¬ 
ence,  in  the  spirit  of  the  max-min  compositional  inference  scheme  discussed  above.  FAM 
decompositional  inference  decomposes  the  compound  FAM  rule  ( A,B ;  C)  into  the  com¬ 
ponent  rules  (A,C)  and  ( B,C ).  The  simpler  component  rules  are  processed  in  parallel. 
New  fuzzy  set  inputs  A'  and  D'  pass  through  the  FAM  matrices  Mac  and  Mgc-  Max-min 
composition  then  gives  the  recalled  fuzzy  sets  Ca>  and  Cg<: 


Ca< 

=  A'  0 

Mac  1 

(22) 

Cg> 

=  B'  0 

Mgc  ■ 

(23) 

The  trick  is  to  recompose  the  fuzzy  sets  C 4*  and  Cg>  with  intersection  or  union  according 
as  the  antecedent  terms  “ X  is  A”  and  “Y  is  B ”  are  combined  with  AND  or  OR.  The  negated 
antecedent  term  “X  is  NOT  A”  requires  forming  the  set  complement  CCA,  for  input  fuzzy 
set  A! . 

Suppose  we  present  the  new  inputs  A!  and  B'  to  the  singlc-FAM-rulc  system  F  that 
stores  the  FAM  rule  ( A ,  B;  C).  Then  the  recalled  output  fuzzy  set  C'  equals  the  intersec¬ 
tion  of  C a'  and  Cg>: 


F(A\  B')  =  [A'  O  Mac]  n  \B'  o  MHC] 


=  C  A'  f”1  Fg< 


(24) 


-  C‘ 


We  ran  then  defuzzify  C,  if  we  wish,  to  yield  the  exact  output  ly 


The  logical  connectives  apply  to  the  antecedent  terms  of  different  dimension  and  mean¬ 
ing.  Decompositional  inference  applies  the  set-theoretic  analogues  of  the  logical  connectives 
to  subsets  of  Z.  Of  course  all  subsets  C'  of  Z  have  the  same  dimension  and  meaning. 

We  now  prove  that  decompositional  inference  generalizes  BIOFAM  inference,  lliis  gen¬ 
eralization  is  not  simply  formal.  It  opens  an  immediate  path  to  adaptation  with  arbitrary 
neural  network  techniques. 

Suppose  we  present  the  exact  inputs  x,  and  yj  to  the  single-FAM-rule  system  F  that 
stores  (A,  J3;  C).  So  we  present  the  unit  bit  vectors  Px  and  PY  to  F  as  nonfuzzy  set  inputs. 
Then 


F(xi,yj)  =  F(I'X,  Fy)  =  [Px  O  Mac]  n  [PY  o  MBC] 

=  at  A  C  n  bj  A  C  (25) 

=  min(at,  bj)  A  C  .  (26) 

(25)  follows  from  (8).  Representing  C  with  its  membership  function  me,  (26)  is  equivalent 
to  the  BIOFAM  prescription 


min(a,-,  b})  A  mc(z)  (27) 

for  all  z  in  Z. 

If  we  encode  the  simple  FAM  rules  (A,  C)  and  {B ,  C)  with  correlation-product  encoding, 
decompositional  inference  gives  the  BIOFAM  version  of  correlation-product  inference: 


F(Px,Py)  =  f I\.  o  AtC]  n  \PY  o  I3rC) 

—  di  C  n  bj  c 

=  min(«i,  b})  C 
=  min(a,,  b} )  mc(z) 

for  all  z  in  Z.  (13)  implies  (28).  min(a,  ck,  b,  rk)  =  min(a„  bj)  ck  implies  (29). 


(28) 

(29) 

(30) 


Decompositional  inference  allows  arbitrary  fuzzy  sets,  waveforms,  or  distributions  A’ 
and  B’  to  be  applied  to  a  FAM  system.  The  FAM  system  can  house  an  arbitrary  FAM 
bank  of  compound  FAM  rules.  If  we  use  the  FAM  system  to  control  a  process,  the  input 
fuzzy  sets  A'  and  B'  can  be  the  output  of  an  independent  state- estimation  system,  such 
as  a  Kalman  filter.  A!  and  B'  might  then  represent  probability  distributions  on  the  exact 
input  spaces  X  and  Y .  The  filter-controller  cascade  is  a  common  engineering  architecture. 

We  can  split  compound  consequents  as  desired.  We  can  split  the  compound  FAM  rule 
“IF  X  is  A  AND  Y  is  B  ,  THEN  Z  is  C  OR  W  is  £>,”or(A,£;  C,D), 
into  the  FAM  rules  (A,  B;  C )  and  (A,  B]  D).  Wc  can  use  the  same  split  if  the  consequent 
logical  connective  is  AND. 

We  can  give  a  propositional-calculus  justification  for  the  decompositional  inference 
technique.  Let  A,  Z?,  and  C  be  bivalent  propositions  with  truth  values  t(A),  t(B),  and 
t(C )  in  {0,1}.  Then  we  can  construct  truth  tables  to  prove  the  two  consequent-splitting 
tautologies  that  we  use  in  decompositional  inference: 


[A  —4  (B  OR  C)]  — ♦  [(A  — ♦  B)  OR  (A  — ►  C)]  , 

[A  — - >  ( B  AND  C)J  — ►  [(A  — >  B)  AND  (A  — ♦  C)]  , 


(31) 

(32) 


where  the  arrow  represents  logical  implication. 

In  bivalent  logic,  the  implication  A  — *  B  is  false  ifF  the  antecedent  A  is  true  and  the 
consequent  B  is  false.  Equivalently,  t(A  —*  B)  =  1  iff  t(A)  =  1  and  t(B )  =  0. 
This  allows  a  “brief”  truth  table  to  be  constructed  to  check  for  validity.  We  chose  truth 
values  for  the  terms  in  the  consequent  of  the  overall  implication  (31)  or  (32)  to  make 
the  consequent  false.  Given  those  restrictions,  if  wc  cannot  find  truth  values  to  make  the 
antecedent  true,  the  statement  is  a  tautology.  In  (31),  if  /((A  — >  B)  OR  (A  — >  C))  =  0, 
then  '(/)  =  1  and  t(B)  =  t(C)  —  0,  since  a  disjunction  is  false  iff  both  disjuncts  are 
false.  This  forces  the  antecedent  A  —>  (B  Oil  C)  to  be  false.  So  (31)  is  a  tautology:  It 
is  true  in  all  cases. 


We  can  also  justify  splitting  the  compound  FAM  rule  “IF  X  is  A  OR  V  is  B  , 
THEN  Z  is  C  ”  into  the  disjunction  (union)  of  the  two  simple  FAM  rules  “IF  A  is  A  , 


I  HEN  7j  is  C  ”  and  “H‘  Y  is  B  ,  THEN  /j  is  C  ”  with  a  proposition  tl  tautology: 

((A  OR  B)  — -  C]  — ♦  [(A  — >  C)  OR  (B  — >  C)\  .  (33) 

Now  consider  splitting  the  original  compound  FAM  rule  “IF  X  is  A  AND  Y  is  B  , 
THEN  Z  is  C  "  into  the  conjunction  (intersection)  of  the  two  simple  FAM  rules  “IF  X 
is  A  ,  THEN  Z  is  C  ”  and  “IF  Y  is  B  ,  THEN  Z  is  C  .”  A  problem  arises  when 
we  examine  the  truth  table  of  the  corresponding  proposition 

[(A  AND  B)  — ►  C]  — ♦  [(A  — >  C)  AND  (B  — ♦  C) )  .  (34) 

The  problem  is  that  (34)  is  not  always  true,  and  hence  not  a  tautology.  The  implication 

is  false  if  A  is  true  and  B  and  C  are  false,  or  if  A  and  C  are  false  and  B  is  true.  But  the 

implication  (34)  is  valid  if  both  antecedent  terms  A  and  B  are  true.  So  if  t(  A)  —  t{B)  =  1, 
the  compound  conditional  (A  AND  B)  —*  C  implies  both  A  — »  C  and  B  —*  C . 

The  simultaneous  occurrence  of  the  data  values  i,  and  y}  satisfies  this  condition.  Recall 
that  logic  is  1-dimensional  set  theory.  The  condition  t(A)  =  t{B)  =  1  is  given  by  the  1  in 
I'x  and  the  1  in  I3X.  We  can  interpret  the  unit  bit  vectors  Px  and  Iy  as  the  (true)  bivalent 
propositions  “ X  is  x,”  and  UY  is  yj.v  Propositional  logic  applies  coordinate-wise.  A 
similar  argument  holds  for  the  converse  of  (33). 

For  general  fuzzy  set  inputs  A'  and  B1  the  argument  still  holds  in  the  sense  of  continuous¬ 
valued  logic.  But  the  truth  values  of  the  logical  implications  may  be  less  than  unity  while 
greater  than  zero.  If  A'  is  a  null  vector  and  B'  is  not,  or  vice  versa,  the  implication  (34) 
is  false  coordinate-wise,  at  least  if  one  coordinate  of  the  non-null  vector  is  unity.  But  in 
this  case  the  decompositional  inference  scheme  yields  an  output  null  vector  C .  In  effect 
the  FAM  system  indicates  the  propositional  falsehood. 

Adaptive  Decompositional  Inference 

The  decompositional  inference  scheme  allows  the  splitting  matrices  M ,\c  and  Mrc  to 
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be  arbitrary.  Indeed  it  allows  them  to  be  eliminated  altogether. 

Let  N\-  :  /"  — +  Iq  be  an  arbitrary  neural  network  system  that  maps  fuzzy  subsets  A' 

of  X  to  fuzzy  subsets  C'  of  Z.  Ny  :  Iv  — *  Iq  can  be  a  different  neural  network.  In  general 
N\  and  Ny  are  time-varying. 

The  adaptive  decompositional  inference  (ADI)  scheme  allows  compound  PAM  rules  to 
be  adaptively  split,  stored,  and  modified  by  arbitrary  neural  networks.  The  compound 
FAM  rule  “IF  X  is  A  AND  Y  is  D ,  THEN  Z  is  C,”  or  (A,J3;  C),  can  be  split 
by  Nx  and  Ny.  Nx  can  house  the  simple  FAM  association  {A,C).  Ny  can  house  (B,C). 
Then  for  arbitrary  fuzzy  set  inputs  A'  and  B1,  ADI  proceeds  as  before  for  an  adaptive 
FAM  system  F  :  /"  x  /p  — +  Iq  that  houses  the  FAM  rule  ( A,B ;  C)  or  a  bank  of  such 
FAM  rules: 


F{A\B')  =  Nx(A')  n  Ny(B')  (35) 

=  Ca'  n  Cb' 

=  C'  . 

Any  neural  network  technique  can  be  used.  A  reasonable  candidate  for  many  un¬ 
structured  problems  is  the  backpropagation  algorithm  applied  to  several  small  feedforward 
multilayer  networks.  The  primary  concerns  are  space  and  training  time.  Several  small 
neural  networks  can  often  be  trained  in  parallel  faster,  and  more  accurately,  than  a  single 
large  neural  network. 

The  ADI  approach  illustrates  one  way  neural  algorithms  can  be  embedded  in  a  FAM 
architecture.  Below  we  discuss  another  way  that  uses  unsupervised  clustering  algorithms. 


ADAPTIVE  FAMs:  PRODUCT-SPACE  CLUSTERING 
IN  FAM  CELLS 


An  adaptive  FAM  (AFAM)  is  a  time-varying  mapping  between  fuzzy  cubes.  In 
principle  the  adaptive  decompositional  inference  technique  generates  A  FA  Ms.  But  we 
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shall  reserve  the  label  A  FAM  for  systems  that  generate  FAM  rules  from  training  data  but 
that  do  not  require  splitting  and  recombining  FAM  data. 

We  propose  a  geometric  AFAM  procedure.  The  procedure  adaptively  clusters  training 
samples  in  the  FAM  system  input-output  product  space.  FAM  mappings  are  balls  or  clusters 
in  the  input-output  product  space.  These  clusters  are  simply  the  fuzzy  Hebb  matrices 
discussed  above.  The  procedure  “blindly”  generates  weighted  FAM  rules  from  training 
data.  Further  training  modifies  the  weighted  set  of  FAM  rules.  We  call  this  unsupervised 
procedure  product-space  clustering. 

Consider  first  a  discrete  1-dimensional  FAM  system  S  :  P  —*  P.  Then  a  FAM  rule 
has  the  form  “IF  X  is  A,  ,  THEN  Y  is  £?,  ”  or  (A,,  Bt).  The  input-output  product 
space  is  In  x  P. 

What  does  the  FAM  rule  (A,,  B{)  look  like  in  the  product  space  /"  x  PI  It  looks  like  a 
cluster  of  points  centered  at  the  numerical  point  (A,,  B{).  The  FAM  system  maps  points 
A  near  A,  to  points  B  near  /?,.  The  closer  A  is  to  A,,  the  closer  the  point  (A,  B)  is  to  the 
point  (Aj,  B{ )  in  the  product  space  P  x  P.  In  this  sense  FAMs  map  balls  in  P  to  balls 
in  P.  The  notation  is  ambiguous  since  (Aj,  Bi)  stands  for  both  the  FAM  rule  mapping, 
or  fuzzy  subset  of  P  x  P,  and  the  numerical  fit-vector  point  in  P  x  P. 

Adaptive  clustering  algorithms  can  estimate  the  unknown  FAM  rule  (Aj,  Bi)  from  train¬ 
ing  samples  of  the  form  (A,  B).  In  general  there  are  m  unknown  FAM  rules  (Aj,  Bj),  . . . , 
(Am,  Bm).  The  number  m  of  FAM  rules  is  also  unknown.  The  user  may  select  m  arbitrarily 
in  many  applications. 

Competitive  adaptive  vector  quantization  (AVQ)  algorithms  can  adaptively  estimate 
both  the  unknown  FAM  rules  ( A , ,  /?, )  and  the  unknown  number  m  of  FAM  rules  from 
FAM  system  input-output  data.  The  AVQ  algorithms  do  not  require  fuzzy-set  data.  Scalar 
BIOFAM  data  suffices,  as  we  illustrate  below  for  adaptive  estimation  of  inverted-pendulum 
control  FAM  rules. 

Suppose  the  r  fuzz}'  sets  Aj,  .  . . ,  Ar  quantize  the  input  universe  of  discourse  X .  The 
s  fuzzy  sets  /fi,  . . .,  Bs  quantize  the  output  universe  of  discourse  Y.  In  general  r  and  s 
are  unrelated  to  each  other  and  to  the  number  in  of  FAM  rules  (A,,  B,).  The  user  must 
specify  r  and  .s  and  the  shape  of  the  fuzzy  sets  A,  and  Bt.  In  practice  this  is  not  difficult. 


41 


Quantizing  fuzzy  sets  are  usually  trapezoidal,  and  r  and  s  are  less  than  10. 

The  quantizing  collections  {A}  and  {7?y}  define  rs  FAM  cells  7\y  in  the  input-output 
product  space  7"  x  7P.  The  FAM  cells  Fij  overlap  since  contiguous  quantizing  fuzzy  sets  A, 
and  A,+i,  and  Bj  and  Bj+ 1,  overlap.  So  the  FAM  cell  collection  {F,y}  does  not  partition 
the  product  space  7"  x  7P.  The  union  of  all  FAM  cells  also  does  not  equal  7”  x  Jp  since 
the  patches  Fij  are  fuzzy  subsets  of  7n  x  Ip.  The  union  provides  only  a  fuzzy  “cover”  for 
/"  x  Ip. 

The  fuzzy  Cartesian  product  A,-  x  77,  defines  the  FAM  cell  Fjy.  A,  x  77,  is  just  the 
fuzzy  outer  product  Aj  o77,  in  (6)  or  the  correlation  product  Af  Bi  in  (12).  So  a  FAM  cell 
Fij  is  simply  the  fuzzy  correlation-minimum  or  correlation-product  matrix  MtJ  :  F,j  =  Mt] . 

Adaptive  FAM  Rule  Generation 

Let  mi,...,rrifc  be  k  quantization  vectors  in  the  input-output  product  space  7n  x  Ip 
or,  equivalently,  in  /n+p.  my  is  the  jth  column  of  the  synaptic  connection  matrix  M.  M 
has  n  +  p  rows  and  k  columns. 

Suppose,  for  instance,  my  changes  in  time  according  to  the  differential  competitive 
learning  (DCL)  AVQ  algorithm  discussed  in  Chapters  6  and  9.  The  competitive  system 
samples  concatenated  fuzzy  set  samples  of  the  form  [A|Z7].  The  augmented  fuzzy  set  [A|77] 
is  a  point  in  the  unit  hypercube  /n+p. 

The  synaptic  vectors  my  converge  to  FAM  matrix  centroids  in  In  x  Ip.  More  generally 
they  estimate  the  density  or  distribution  of  the  FAM  rules  in  In  x  7P.  The  quantizing 
synaptic  vectors  naturally  weight  the  estimated  FAM  rule.  The  more  synaptic  vectors 
clustered  about  a  centroidal  FAM  rule,  the  greater  its  weight  w,  in  (17). 

Suppose  there  are  15  FAM-rule  centroids  in  /"  x  7P  and  k  >  15.  Suppose  k,  synaptic 
vectors  my  cluster  around  the  ith  centroid.  So  A-!  4-  ...  +  k  15  =  k.  Suppose  the  cluster 
counts  k,  arc  ordered  as 

k\  >  k,2  >  . . .  A:|5  .  (36) 
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The  first  centroidal  FAM  rule  is  as  at  least  as  frequent  as  the  second  centroidal  FAM 
rule,  and  so  on.  This  gives  the  adaptive  FAM-rule  weighting  scheme 


w,  = 


k 

k 


(37) 


The  FAM  rule  weights  to,  evolve  in  time  as  new  augmented  fuzz}'  sets  [A|i3]  are  sampled. 
In  practice  we  may  want  only  the  15  most-frequent  FAM  rules  or  only  the  FAM  rules  with 
at  least  some  minimum  frequency  w^n-  Then  (37)  provides  a  quantitative  solution. 

Geometrically  we  count  the  number  of  quantizing  vectors  in  each  FAM  cell  F,j .  We 
can  define  FAM-cell  boundaries  in  advance.  High-count  FAM  cells  outrank  low-count  FAM 
cells.  Most  FAM  cells  contain  zero  or  few  synaptic  vectors. 

Product-space  clustering  extends  to  compound  FAM  rules  and  product  spaces.  The 
FAM  rule  “IF  X  is  A  AND  Y  is  B,  THEN  Z  is  C”,  or  (A,  B ;  C),  is  a  point  in 
/”  x  Ip  x  Iq.  The  t  fuzzy  sets  quantize  the  new  output  space  Z.  There  are 

rst  FAM  cells  Fijk-  (36)  and  (37)  extend  similarly.  A',  Y,  and  Z  can  be  continuous.  The 
adaptive  clustering  procedure  extends  to  any  number  of  FAM-rule  antecedent  terms. 


Adaptive  BIOFAM  Clustering 


BIOFAM  data  clusters  more  efficiently  than  fuzzy-set  FAM  data.  Paired  numbers  are 
easier  to  process  and  obtain  than  paired  fit  vectors.  This  allows  system  input-output  data 
to  directly  generate  FAM  systems. 

In  control  applications,  human  or  automatic  controllers  generate  streams  of  “well- 
controlled”  system  input-output  data.  Adaptive  BIOFAM  clustering  converts  this  data, 
to  weighted  FAM  rules.  The  adaptive  system  transduces  behavioral  data  to  behavioral 
rules.  The  fuzzy  system  learns  causal  patterns.  It  learns  which  control  inputs  cause  which 
control  outputs.  The  system  approximates  these  causal  patterns  when  it  acts  as  the  con¬ 
troller. 

Adaptive  BIOFAMs  cluster  in  the  input-output  product  space  X  x  Y  .  The  product 
space  X  x  Y  is  vastly  smaller  than  the  power-set  product  space  /"  x  Ir  used  above.  The 


adaptive  synaptic  vectors  irij  are  now  2-dimcnsionaI  instead  of  n  +  p-dimensional.  On 
the  other  hand,  competitive  BIOFAM  clustering  requires  many  more  input-output  data 
pairs  (x ,,  t/,)  c  R2  than  augmented  fuzzy-set.  samples  [/t|F]  c  In+P. 

Again  our  notation  is  ambiguous.  We  now  use  x,  as  the  numerical  sample  from  A' 
at  sample  time  i.  Earlier  x,  denoted  the  ith  ordered  element  in  the  finite  nonfuzzy  set 
X  =  {xj,...,xn}.  One  advantage  is  X  can  be  continuous,  say  Rn. 

BIOFAM  clustering  counts  synaptic  quantization  vectors  in  FAM  cells.  The  system 
samples  the  nonfuzzy  input-output  stream  (xi,  t/t),  (x2, 2/2)1  -  •  •  Unsupervised  competitive 
learning  distributes  the  k  synaptic  quantization  vectors  in  A'  x  E  Learning 

distributes  them  to  different  FAM  cells  F,j.  The  FAM  cells  F,j  overlap  but  are  nonfuzzy 
subcubes  of  X  x  Y .  The  BIOFAM  FAM  cells  FtJ  cover  X  x  Y . 

Fij  contains  ArtJ  quantization  vectors  at  each  sample  time.  The  cell  counts  kjj  define  a 
frequency  histogram  since  all  ktJ  sum  to  k.  So  Wij  =  ^  weights  the  FAM  rule  “IF  X  is 
A„  THEN  Y  is  BjX 

Suppose  the  pairwise-overlapping  fuzzy  sets  NL,  NM,  NS ,  ZE,P  S,  PM,  PL  quan¬ 
tize  the  input  space  X.  Suppose  seven  similar  fuzzy  sets  quantize  the  output  space  Y.  We 
can  define  the  fuzzy  sets  arbitrarily.  In  practice  they  are  normal  and  trapezoidal.  (The 
boundary  fuzzy  sets  NL  and  PL  are  ramp  functions.)  X  and  Y  may  each  be  the  real  line. 
A  typical  FAM  rule  is  “IF  X  is  NL,  THEN  Y  is  PS." 

Input  datum  x;  is  nonfuzzy.  When  X  =  X;  holds,  the  relations  X  =  NL, . . . ,  X  =  PL 
hold  to  different  degrees.  Most  hold  to  degree  zero.  X  =  NM  holds  to  degree  m^M(xi). 
Input  datum  x,  partially  activates  the  FAM  rule  “IF  X  is  NM,  THEN  Y  is  ZE"  or, 
equivalently,  (NM;  ZE).  Since  the  FAM  rules  have  single  antecedents,  x,  activates  the 
consequent  fuzzy  set  ZE  to  degree  m/vw(.T,)  as  well.  Multi-antecedent  FAM  rules  activate 
output  consequent  sets  according  to  a  logic-based  function  of  antecedent  term  membership 
values,  as  discussed  above  on  BIOFAM  inference. 

Suppose  Figure  17.5  represents  the  input-output  data  stream  (x\,y\ ),  (x2, 2/2)1  -  in  the 
planar  product  space  A"  x  Y: 


Y 


NL  NM  NS  ZE  PS  PM  PL 


FIGURE  17.5  Distribution  of  input-output  data  (x,-,  yt)  in  the  input-output 
product  space  X  xY.  Data  clusters  reflect  FAM  rules,  such  as  the  steady-state 
FAM  rule  “IF  X  is  ZE,  THEN  Y  is  ZE" . 

Suppose  the  sample  data  in  Figure  17.5  trains  a  DCL  system.  Suppose  such  competi 
tive  learning  distributes  ten  2-dimensional  synaptic  vectors  nti, . . . ,  mI0  as  in  Figure  17.6 


Y 
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FIGURE  17.6  Distribution  of  ten  2-dimensional  synaptic  quantization  vec¬ 
tors  mi, ... , mio  in  the  input-output  product  space  XxY.  As  the  FAM  system 
samples  nonfuzzy  data  (x,-,  y,),  competitive  learning  distributes  the  synaptic 
vectors  in  A'  x  Y.  The  synaptic  vectors  estimate  the  frequency  distribution  of 
the  sampled  input-output  data,  and  thus  estimate  FAM  rules. 

FAM  cells  do  not  overlap  in  Figures  17.5  and  17.6  for  convenience’s  sake.  The  corre¬ 
sponding  quantizing  fuzzy  sets  touch  but  do  not  overlap. 

Figure  17.5  reveals  six  sample-data  clusters.  The  six  quantization-vector  clusters  in 
Figure  17.6  estimate  the  six  sample-data  clusters.  The  single  synaptic  vector  in  FAM  cell 
(PM;  NS)  indicates  a  smaller  cluster.  Since  k  =  10,  the  number  of  quantization  vectors 
in  each  FAM  cell  measures  the  percentage  or  frequency  weight  wtJ  of  each  possible  FAM 
rule. 
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In  general  the  additive  combination  rule  (17)  does  not  recuire  normalizing  the  quantization- 
vector  count  kij.  Wij  =  kij  is  acceptable.  This  holds  for  both  maximum-membership  de¬ 
fuzzification  (18)  and  fuzzy  centroid  defuzzification  (19).  These  defuzzification  schemes 
prohibit  only  negative  weight  values. 

The  ten  quantization  vectors  in  Figure  17.6  estimate  at  most  six  FAM  rules.  From  most 
to  least  frequent  or  “important”,  the  FAM  rules  are  ( ZE;  ZE),  ( PS]  NS),  (NS;  PS), 
(PM;  NS),  (PL;  NL),  and  (N L;  PL).  These  FAM  rules  suggest  that  fuzzy  variable  X  is 
an  error  variable  or  an  error  velocity  variable  since  the  steady-state  FAM  rule  (ZE;  ZE)  is 
most  important.  If  we  sample  a  system  only  in  steady-state  equilibrium,  we  will  estimate 
only  the  steady-state  FAM  rule.  We  can  accurately  estimate  the  FAM  system’s  global 
behavior  only  if  we  representatively  sample  the  system’s  input-output  behavior. 

The  “corner”  FAM  rules  (PL;  NL)  and  (N L;  PL)  may  be  more  important  than  their 
frequencies  suggest.  The  boundary  sets  Negative  Large  (NL)  and  Positive  Large  (PL) 
are  usually  defined  as  ramp  functions,  as  negatively  and  positively  sloped  lines.  NL  and 
PL  alone  cover  the  important  end-point  regions  of  the  universe  of  discourse  X.  They  give 
ro/V£,(aO  =  rnpiXx)  =  1  °nly  if  x  is  at  or  near  the  end-point  of  X ,  since  NL  and  PL  are 
ramp  functions  not  trapezoids.  NL  and  PL  cover  these  end-point  regions  “briefly”.  Their 
corresponding  FAM  cells  tend  to  be  smaller  than  the  other  FAM  cells.  The  end-point 
regions  must  be  covered  in  most  control  problems,  especially  error  nulling  problems  like 
stabilizing  an  inverted  pendulum.  The  user  can  weight  these  FAM-cell  counts  more  highly, 
for  instance  =  c  kij  for  scaling  constant  c  >  0.  Or  the  user  can  simply  include  these 
end-point  FAM  rules  in  every  operative  FAM  bank. 

Most  FAM  cells  do  not  generate  FAM  rules.  More  accurately,  we  estimate  every  possible 
FAM  rule  but  usually  with  zero  or  near-zero  frequency  weight  wtJ .  For  large  numbers  of 
multiple  FAM-rule  antecedents,  system  input-output  data  streams  through  comparatively 
few  FAM  cells.  Structured  trajectories  in  A'  x  Y  are  few. 

A  FAM-rule’s  mapping  structure  also  limits  the  number  of  estimated  FAM  rules.  A 
FAM  rule  maps  fuzzy  sets  in  /n  or  F(2X  )  to  fuzzy  sets  in  Ip  or  F(2)  ).  A  fuzzy  associative 
memory  maps  every  domain  fuzzy  set  A  to  a  unique  range  fuzzy  set  D.  Fuzzy  set  A  cannot 
map  to  multiple  fuzzy  sets  B,  B' ,  B" ,  and  so  on.  We  write  the  FAM  rule  as  (A;  B)  not 
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( A ■  B  or  B'  or  B"  or. . .).  So  wc  estimate  at  most  one  rule  per  FAM-cell  row  in  Figure 
17.6. 

If  two  FAM  cells  in  a  row  arc  equally  and  highly  frequent,  wc  can  pick  arbitrarily  either 
FAM  rule  to  include  in  the  FAM  bank.  This  occurs  infrequently  but  can  occur.  In  principle 
we  could  estimate  the  FAM  rule  as  a  compound  FAM  rule  with  a  disjunctive  consequent. 
The  simplest  strategy  picks  only  the  highest  frequency  FAM  cell  per  row. 

The  user  can  estimate  FAM  rules  without  counting  the  quantization  vectors  in  each 
FAM  cell.  There  may  be  too  many  FAM  cells  to  search  at  each  estimation  iteration. 
The  user  never  need  examine  FAM  cells.  Instead  the  user  checks  the  synaptic  vector 
components  m,7.  The  user  defines  in  advance  fuzzy-set  intervals,  such  as  [//vz.,J1wx]  for 
NL.  If  lNL  <  mt]  <  u/vz.,  then  the  FAM-antecedent  reads  “IF  A'  is  N  L.n 

Suppose  the  input  and  output  spaces  X  and  Y  are  the  same,  the  real  interval  [—35,  35]. 
Suppose  we  partition  X  and  Y  into  the  same  seven  disjoint  fuzzy  sets: 

NL  =  [-35,  -25] 

NM  =  [-25,  -15] 

NS  =  [-15,  -5] 

ZE  =  [-5,  5] 

PS  =  [5,  15] 

PM  =  [15,  25] 

PL  =  [25,  35]  . 

Then  the  observed  synaptic  vector  irij  =  [9,  —10]  increases  the  count  of  FAM  cell 

PS  x  NS  and  increases  the  weight  of  FAM  rule  ”IF  X  is  PS ,  THEN  Y  is  NS” 

This  amounts  to  nearest- neighbor  classification  of  synaptic  quantization  vectors.  We 
assign  quantization  vector  m*  to  FAM  cell  Fij  iff  nit  is  closer  to  the  centroid  of  FtJ  than 
to  all  other  FAM-cell  centroids.  We  break  ties  arbitrarily.  Centroid  classification  allows 
the  FAM  cells  to  overlap. 
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Adaptive  BIOFAM  Example:  Inverted  Pendulum 


We  used  DCL  to  train  an  AFAM  to  control  the  inverted  pendulum  discussed  above. 
We  used  the  accompanying  C-software  to  generate  1,000  pendulum  trajectory  data.  These 
product-space  training  vectors  ( 0 ,  AO,  v)  were  points  in  R3.  Pendulum  angle  0  data 
ranged  between  —90  and  90.  Pendulum  angular  veclocity  A0  data  ranged  from  —150  to 
150. 

We  defined  FAM  cells  by  uniformly  partitioning  the  effective  product  space.  Fuzzy 
variables  could  assume  only  the  five  fuzzy  set  values  NM,  NS,  ZE,  PS,  and  PM.  So 
there  were  125  possible  FAM  rules.  For  instance,  the  steady-state  FAM  rule  took  the  form 
{ZE,  ZE\  ZE)  or,  more  completely,  “IF  0  =  ZE  AND  A0  =  ZE,  THEN  v  =  ZEZ 
A  BIOFAM  controlled  the  inverted  pendulum.  The  BIOFAM  restored  the  pendulum 
to  equilibrium  as  we  knocked  it  over  to  the  right  and  to  the  left.  (Function  keys  F9  and 
F10  knock  the  pendulum  over  to  the  left  and  to  the  right.  Input-output  sample  data 
reads  automatically  to  a  training  data  file.)  Eleven  FAM  rules  described  the  BIOFAM 
controller.  Figure  17.1  displays  this  FAM  bank.  Observe  that  the  zero  {ZE)  row  and 
column  are  ordinal  inverses  of  the  respective  row  and  column  indices. 
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FIGURE  17.7  Inverted-pendulum  FAM  bank  used  in  simulation.  This 
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BIOFAM  generated  1,000  sample  vectors  of  the  form  ( 0 ,  AO,  v). 

We  trained  125  3-dimensional  synaptic  quantization  vectors  with  differentia!  compet¬ 
itive  learning,  as  discussed  in  Chapters  4,6,  and  9.  In  principle  the  125  synaptic  vectors 
could  describe  a  uniform  distribution  of  product-space  trajectory  data.  Then  the  125 
FAM  cells  would  each  contain  one  synaptic  vector.  Alternatively,  if  we  used  a  vertically 
stabilized  pendulum  to  generate  the  1,000  training  vectors,  all  125  synaptic  vectors  would 
concentrate  in  the  ( ZE ,  ZE;  ZE)  FAM  cell.  This  would  still  be  true  if  we  only  mildly 
perturbed  the  pendulum  from  vertical  equilibrium. 

DCL  distributed  the  125  synaptic  vectors  to  13  FAM  cells.  So  we  estimated  13  FAM 
rules.  Some  FAM  cells  contained  more  synaptic  vectors  than  others.  Figure  17.8  displays 
the  synaptic- vector  histogram  after  the  DCL  samples  the  1,000  samples.  Actually  Figure 
17.8  displays  a  truncated  histogram.  The  horizontal  axis  should  list  all  125  FAM  cells, 
all  125  FAM-rule  weights  tuk  in  (17).  The  missing  112  entries  have  zero  synaptic- vector 
frequency. 

Figure  17.8  gives  a  snapshot  of  the  adaptive  process.  In  practice,  and  in  principle, 
successive  data  gradually  modify  the  histogram.  “Good”  training  samples  should  include 
a  significant  number  of  equilibrium  samples.  In  Figure  17.8  the  steady-state  FAM  cell 
(ZE,  ZE;  ZE)  is  clearly  the  most  frequent. 


FIGURE  17.8  Synaptic-vector  histogram.  Differential  competitive  learn¬ 
ing  allocated  125  3-dimensional  synaptic  vectors  to  the  125  FAM  cells.  Here 
the  adaptive  system  has  sampled  1,000  representative  pendulum-control  dnta. 

DCL  allocates  the  synaptic  vectors  to  only  13  FAM  cells.  The  steady-state 
FAM  cell  ( ZE ,  ZE;  ZE)  is  most  frequent. 

Figure  17.9  displays  the  DCL-estimated  FAM  bank.  The  product-space  clustering 
method  rapidly  recovered  the  11  original  FAM  rules.  It  also  estimated  the  two  additional 
FAM  rules  (PS,  N  M;  ZE)  and  (NS,  PM;  ZE),  which  did  not  affect  the  BIOFAM 
system’s  performance.  The  estimated  FAM  bank  defined  a  BIOFAM,  with  all  13  FAM- 
rulc  weights  set  Wk  equal  to  unity,  that  controlled  the  pendulum  as  well  as  the  original 
BIOFAM  did. 
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FIGURE  17.9  DCL-esti mated  FAM  bank.  Product-space  clustering  re¬ 
covered  the  original  11  FAM  rules  and  estimated  two  new  FAM  rules.  The  new 
and  original  BIOFAM  systems  controlled  the  inverted  pendulum  equally  well. 

In  nonrealtime  applications  we  can  in  principle  omit  the  adaptive  step  altogether.  We 
can  directly  compute  the  FAM-cell  histogram  if  we  exhaustively  count  all  sampled  data. 
Then  the  (growing)  number  of  synaptic  vectors  equals  the  number  of  training  samples.  This 
procedure  equally  weights  all  samples,  and  so  tends  not  to  “track”  an  evolving  process. 
Competitive  learning  weights  more  recent  samples  more  heavily.  Competitive  learning’s 
metrical-classification  step  also  helps  filter  noise  from  the  stream  of  sample  data. 
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PROBLEMS 


1.  Use  correlation-minimum  encoding  to  construct  the  FAM  matrix  M  from  the  fit- 
vector  pair  ( A ,  D)  if  A  =  (.6  1  .2  .9)  and  B  =  (.8  .3  1).  Is  (A,B)  a  bidirectional 
fixed  point?  Pass  A'  =  (.2  .9  .3  .2)  through  M  and  B'  =  (.9  .5  1)  through  MT. 
Do  the  recalled  fuzzy  sets  differ  from  B  and  A1 
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2.  Repeat  Problem  1  using  correlation-product  encoding. 


3.  Compute  the  fuzzy  entropy  E(M)  of  M  in  Problems  1  and  2. 

4.  If  M  =  A1  o  D  in  Problem  1,  find  a  different  FAM  matrix  M'  with  greater  fuzzy 

entropy,  E(M')  >  E(M),  but  that  still  gives  perfect  recall:  A  o  M'  =  B. 

Find  the  maximum  entropy  fuzzy  associative  memory  (MEFAM)  matrix  M*  such 
that  A  o  M*  =  B. 

5.  Prove:  If  M  —  A?  o  B  or  M  =  AT  B,  A  o  M  —  B,  and  A  C  A',  then 
A'  o  M  =  B. 


6.  Prove:  max  min(a*,  bk)  <  min(  max  ak,  max  bk). 

Kk<m  V  ’  —  i<fc<m  KfcCm  ’ 


7.  Use  truth  tables  to  prove  the  two-valued  propositional  tautologies: 


(a)  [  A  — ►  (B  OR  C)] 

(b)  [A  — >  (B  AND  C)] 

(c)  [  (A  OR  B) — >  C] 

(d)  [(A  — ♦  C)  AND  (B 


C)] 


[(A  — ♦  B)  OR  (A  — ♦  C)]  , 

[(A  — ♦  B)  AND  (A  — ♦  C)]  , 

[(A  — +  C)  OR  (B  — C)]  , 

— ♦  [  (A  AND  B)  — +  C]  . 


Is  the  converse  of  (c)  a  tautology?  Explain  whether  this  affects  BIOFAM  inference. 


8.  BIOFAM  inference.  Suppose  the  input  spaces  A'  and  Y  arc  both  [—10,  10],  and  the 


output  space  Z  is  [—100, 100].  Define  five  trapezoidal  fuzzy  sets-A^L,  NS ,  ZE ,  PS,  PL 


on  A',  Y,  and  Z.  Suppose  the  underlying  (unknown)  system  transfer  function  is 
z  =  x2  —  y2 .  State  at  least  five  FAM  rules  that  accurately  describe  the  system’s 


behavior.  Use  z  =  x2  —  y2  to  generate  streams  of  sample  data.  Use  BIOFAM  in¬ 
ference  and  fuzzy-centrc  id  defuzzification  to  map  input  pairs  (x,y)  to  output  data  z. 
Plot  the  BIOFAM  outputs  and  the  desired  outputs  z.  What  is  the  arithmetic  average 
of  the  squared  errors  (F(x,y)  —  x2  +  y2)2?  Divide  the  product  space  X  x  Y  x  Z 
into  125  overlapping  FAM  cells.  Estimate  FAM  rules  from  clustered  system  data 
(x,y,z).  Use  these  FAM  rules  to  control  the  system.  Evaluate  the  performance. 


Software  Problems 


The  following  problems  use  the  accompanying  FAM  software  for  controlling  an  inverted 
pendulum. 

1.  Explain  why  the  pendulum  stabilizes  in  the  diagonal  position  if  the  pendulum  bob 

mass  increases  to  maximum  and  the  motor  current  decreases  slightly.  The  pendulum 

stabilizes  in  the  vertical  position  if  you  remove  whiph  FAM  rules? 

r 

2.  Oscillation  results  if  you  remove  which  FAM  rules?  The  pendulum  sticks  in  a  hori¬ 
zontal  equilibrium  if  you  remove  which  FAM  rules? 
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Department  of  Electrical  Engineering 
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ABSTRACT 

We  compared  fuzzy  and  Kalman-filter  control  systems  for  realtime  target  tracking. 
Both  systems  performed  well,  but  in  the  presence  of  mild  process  (unmodeled  effects)  noise 
the  fuzzy  system  exhibited  finer  control.  We  tested  the  robustness  of  the  fuzzy  controller 
by  removing  random  subsets  of  fuzzy  associations  or  “rules”  and  by  adding  destructive  or 
“sabotage”  fuzzy  rules  to  the  fuzzy  system.  We  tested  the  robustness  of  the  Kalman  track¬ 
ing  system  by  increasing  the  variance  of  the  unmodeled-effects  noise  process.  The  fuzzy 
controller  performed  well  until  we  removed  over  50%  of  the  fuzzy  rules.  1  he  Kalman  con¬ 
troller’s  performance  quickly  degraded  as  the  unmodeled-effects  variance  increased.  We 
used  unsupervised  neural-network  learning  to  adaptively  generate  the  fuzzy  controller’s 
fuzzy-associative-memory  structure.  The  fuzzy  systems  did  not  require  a  mathematical 
model  of  how  system  outputs  depended  on  inputs. 
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Fuzzy  and  Math-Model  Controllers 


Fuzzy  controllers  differ  from  classical  math-model  controllers.  Fuzzy  controllers  do 
not  require  a  mathematical  model  of  how  control  outputs  functionally  depend  on  control 
inputs.  Fuzzy  controllers  also  differ  in  the  type  of  uncertainty  they  represent  and  how  they 
represent  it.  The  fuzzy  approach  represents  ambiguous  or  fuzzy  system  behavior  as  partial 
implications  or  approximate  “rules  of  thumb” — as  fuzzy  associations  (A,-,  £!,-). 

Fuzzy  controllers  are  fuzzy  systems.  A  finite  fuzzy  set  A  is  a  point  [Kosko,  1987]  in 
a  unit  hypercube  /”  =  [0,  l]n.  A  fuzzy  system  F  :  /"  — *  Ip  is  a  mapping  between 
unit  hypercubes.  In  contains  all  fuzzy  subsets  of  the  domain  space  A'  =  {xi,...,x„}. 
In  is  the  fuzzy  power  set  F( 2A)  of  X.  Ip  contains  all  the  fuzzy  subsets  of  the  range 
space  Y  =  {jq, . . . ,  yp}.  Element  x,-  e  X  belongs  to  fuzzy  set  A  to  degree  m^x,-).  The  2n 
nonfuzzy  subsets  of  X  correspond  to  the  2n  corners  of  the  fuzzy  cube  /”.  The  fuzzy  system 
F  maps  fuzzy  subsets  of  X  to  fuzzy  subsets  of  Y.  In  general,  X  and  Y  are  continuous  not 
discrete  sets. 

Math-model  controllers  usually  represent  system  uncertainty  with  probability  dis¬ 
tributions.  Probability  models  describe  system  behavior  with  first-order  and  second-order 
statistics — with  conditional  means  and  covariances.  They  usually  describe  unmodeled  ef¬ 
fects  and  measurement  imperfections  with  additive  “noise”  processes. 

Mathematical  models  of  the  system  state  and  measurement  processes  facilitate  a  mean- 
squared-error  analysis  of  system  behavior.  In  general  we  cannot  accurately  articulate  such 
mathematical  models.  This  greatly  restricts  the  range  of  realworld  applications.  In  practice 
wc  often  use  linear  or  quasi-linear  (Markov)  mathematical  models. 

Mathematical  state  and  measurement  models  also  make  it  difficult  to  add  non-mathem- 
atical  knowledge  to  the  system.  Experts  may  articulate  such  knowledge,  or  neural  networks 
may  adaptively  infer  it  from  sample  data.  In  practice,  once  we  have  articulated  the  math 
model,  wc  use  human  expertise  only  to  estimate  the  initial  state  and  covariance  conditions. 

Fuzzy  controllers  consist  of  a  bank  of  fuzzy  associative  memory  (FAM)  “rules”  or 
associations  (A,, /?,)  operating  in  parallel,  and  operating  to  different  degrees.  Each  FAM 
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rule  is  a  set-level  implication.  It  represents  ambiguous  expert  knowledge  or  learned  input- 
output  transformations.  A  FAM  rule  can  also  summarize  the  behavior  of  a  specific  math¬ 
ematical  model.  The  system  nonlinearly  transforms  exact  or  fuzzy  state  inputs  to  a  fuzzy 
set  output.  This  output  fuzzy  set  is  usually  “defuzzified”  with  a  centroid  operation  to 
generate  an  exact  numerical  output.  In  principle  the  system  can  use  the  entire  fuzzy  dis¬ 
tribution  as  the  output.  We  can  easily  construct,  process,  and  modify  the  FAM  bank  of 
FAM  rules  in  software  or  in  digital  VLSI  circuitry. 

Fuzzy  controllers  require  that  we  articulate  or  estimate  the  FAM  rules.  The  fuzzy-set 
framework  provides  more  expressiveness  than,  say,  traditional  expert-system  approaches, 
which  encode  bivalent  propositional  associations.  But  the  fuzzy  framework  does  not  elimi¬ 
nate  the  burden  of  knowledge  acquisition.  We  can  use  neural  network  systems  to  estimate 
the  FAM  rules.  But  neural  systems  also  require  an  accurate  (statistically  representative) 
set  of  articulated  input-output  numerical  samples.  Below  we  use  unsupervised  competitive 
learning  to  adaptively  generate  target-tracking  FAM  rules. 

Experts  can  hedge  their  system  descriptions  with  fuzzy  concepts.  Although  fuzzy  con¬ 
trollers  are  numerical  systems,  experts  can  contribute  their  knowledge  in  natural  language. 
This  is  especially  important  in  complex  problem  domains,  such  as  economics,  medicine, 
and  history,  where  we  may  not  know  how  to  mathematically  model  system  behavior. 

Below  we  compare  a  fuzzy  controller  with  a  Kalman-filter  controller  for  realtime  target 
tracking.  This  problem  admits  a  simple  and  reasonably  accurate  mathematical  description 
of  its  state  and  measurement  processes.  We  chose  the  Kalman  filter  as  a  benchmark  because 
of  its  many  optimal  linear-systems  properties.  We  wanted  to  see  whether  this  “optimal” 
controller  remains  optimal  when  compared  with  a  computationally  lighter  fuzzy  controller 
in  different  uncertainty  environments. 

We  indirectly  compared  the  sensitivity  of  the  two  controllers  by  varying  their  system 
uncertainties.  We  randomly  removed  FAM  rules  from  the  fuzzy  controller.  Wc  also  added 
“sabotage”  FAM  rules  to  the  controller.  Both  techniques  modeled  less-stuctured  control 
environments.  For  the  Kalman  filter,  we  varied  the  noise  variance  of  the  unmodeled-efTects 
noise  process. 

Both  systems  performed  well  for  mildly  uncertain  target  environments.  They  degraded 


differently  as  the  system  uncertainty  increases.  The  fuzzy  controller’s  performance  de¬ 
graded  when  vve  removed  more  than  half  the  FAM  rules.  The  Kalman-filter  controller’s 
performance  quickly  degraded  when  the  additive  state  noise  process  increased  in  variance. 


Realtime  Target  Tracking 


A  target  tracking  system  maps  azimuth-elevation  inputs  to  motor  control  outputs.  The 
nominal  target  moves  through  azimuth-elevation  space.  Two  motors  adjust  the  position 
of  a  platform  to  continuously  point  at  the  target. 

The  platform  can  be  any  directional  device  that  accurately  points  at  the  target.  The 
device  may  be  a  laser,  video  camera,  or  high-gain  antenna.  We  assume  we  have  available 
a  radar  or  other  device  that  can  detect  the  direction  from  the  platform  to  the  target. 

The  radar  sends  azimuth  and  elevation  coordinates  to  the  tracking  system  at  the  end 
of  each  time  interval.  We  calculate  the  current  error  e*  in  platform  position  and  change  in 
error  e*.  Then  a  fuzzy  or  Kalman-filter  controller  determines  the  control  outputs  for  the 
motors,  one  each  for  azimuth  and  elevation.  The  control  outputs  reposition  the  platform. 

We  can  independently  control  movement  along  azimuth  and  elevation  if  we  apply  the 
same  algorithm  twice.  This  reduces  the  problem  to  matching  the  target’s  position  and 
velocity  in  only  one  dimension. 

Figure  1  shows  a  block  diagram  of  the  target  tracking  system.  The  controller’s  output 
Vk  gives  the  estimated  change  in  angle  required  during  the  next  time  interval.  In  principle 
a  hardware  system  must  transduce  the  angular  velocity  Vk  into  a  voltage  or  current. 
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Position 


FIGURE  1  Target  tracking  system. 


FUZZY  CONTROLLER 


We  restrict  the  output  angular  velocity  v*  of  the  fuzzy  controller  to  the  interval  [—6,  6]. 
So  we  must  insert  a  gain  element  before  the  voltage  transduction.  This  gain  must  equal 
one-sixth  the  maximum  angle  through  which  the  platform  can  turn  in  one  time  interval. 
Similarly,  the  position  error  e*  must  be  scaled  so  that  6  equals  the  maximum  error.  The 
product  of  this  scale  factor  and  the  output  gain  provides  a  design  parameter — the  “gain” 
of  the  fuzzy  controller. 

The  fuzzy  controller  uses  heuristic  control  set-level  “rules”  or  fuzzy  associative  memory 
(FAM)  associations  based  on  quantized  values  of  e*..,  c and  ut_i-  We  define  seven  fuzzy 
levels  by  the  following  library  of  fuzzy-set  values  of  the  fuzzy  variables  e*,  e*,  and  u*-i : 


LN  : 

Large  Negative 

MN  : 

Medium  Negative 

SN  : 

Small  Negative 

ZE  : 

Zero 

SP  : 

Small  Positive 

MP  : 

Medium  Positive 

LP  : 

Large  Positive 

We  do  not  quantize  inputs  in  the  classical  sense  that  we  assign  each  input  to  exactly 
one  output  level.  Instead,  each  linguistic  value  equals  as  a  fuzzy  set  that  overlaps  with 
adjacent  fuzzy  sets.  The  fuzzy  controller  uses  trapezoidal  fuzzy-set  values,  as  Figure  2 
shows.  The  lengths  of  the  upper  and  lower  bases  provide  design  parameters  that  we  must 
calibrate  for  satisfactory  performance.  A  good  rule  of  thumb  is  adjacent  fuzzy-set  values 
should  overlap  approximately  25  percent  Below  we  discuss  examples  of  calibrated  and 
uncalibrated  systems.  The  fuzzy  controller  attained  its  best  performance  with  upper  and 
lower  bases  of  1.2  and  3.9 — 26.2%  overlap.  Different  target  scenarios  may  require  more  or 
less  overlap. 


UNIVERSE  OF  DISCOURSE 


FIGURE  2  Library  of  overlapping  fuzzy-set  values  defined  on  a  universe 


G 


of  discourse. 


We  assign  each  system  input  to  a  fit  vector  of  length  7,  where  the  ith  fit,  or  fuzzy  unit 
[Kosko,  1986],  equals  the  value  of  the  ith  fuzzy  set  at  the  input  value.  In  other  words, 
the  ith  fit  measures  the  degree  to  which  the  input  belongs  to  the  ith  fuzzy-set  value.  For 
instance,  we  apply  the  input  values  1,  —4,  and  3.8  to  the  seven  fuzzy  sets  in  the  library  to 
obtain  the  fit  vectors 


(0  0  0  .7  .7  0  0)  , 

(0  1  0  0  0  0  0), 

(0  0  0  0  .1  1  0)  . 

We  determine  these  fit  values  above  by  convolving  a' Dirac  delta  function  centered  at  the 
input  value  with  each  of  the  7  fuzzy  sets: 

m5p(3.8)  =  S(y  -  3.8)  *  mSp(y)  =  -1  .  (1) 

If  we  use  a  discretized  universe  of  discourse,  then  we  use  a  Kronecker  delta  function  in¬ 
stead.  Equivalently,  for  the  discrete  case  n-dimensional  universe  of  discourse  X  =  {xj , . . . , 
xn},  a  control  input  corresponds  to  a  bit  (binary  unit)  vector  B  of  length  n.  A  single  1 
element  in  the  ith  slot  represents  the  “crisp”  input  value  x,-.  Similarly,  we  represent  the 
fcth  library  fuzzy  set  by  an  n-dimensional  fit  vector  Ak  that  contains  samples  of  the  fuzzy 
set  at  the  n  discrete  points  within  the  universe  of  discourse  A'.  The  degree  to  which  the 
crisp  input  x,  activates  each  fuzzy  set  equals  the  inner  product  B  •  Ak  of  the  bit  vector  B 
and  the  corresponding  fit  vector  Ak- 

We  formulate  control  FAM  rules  by  associating  output  fuzzy  sets  with  input  fuzzy  sets. 
The  antecedent  of  each  FAM  rule  conjoins  ejt,  ek,  and  i>k-i  fuzzy-set  values.  For  example, 

IF  Ck  =  MP  AND  efc  =  SN  AND  vk-i  =  ZE,  THEN  vk  =  SP. 


We  abbreviate  this  as  ( M P,  SN,  ZE;  SP). 
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The  scalar  activation  value  to,  of  the  ith  FAM  rule’s  consequent  equals  the  minimum 
of  the  three  antecedent  conjuncts’  values.  If  alternatively  we  combine  the  antecedents 
disjunctively  with  OR ,  the  activation  degree  of  the  consequent  would  equal  the  maximum 
of  the  three  antecedent  disjuncts’  values.  In  the  following  example,  ?n/t(e<:)  denotes  the 
degree  to  which  e*.  belongs  to  the  fuzzy  set  A: 


LN 

MN 

SN 

ZE 

SP 

MP 

LP 

Ck  - 

2.6 

— * ■  (o 

0 

0 

0 

1 

A 

0) 

Ck  = 

-2.0 

—  (0 

0 

1 

0 

0 

0 

0) 

Vk- 1  = 

1.8 

—  (0 

0 

0 

.1 

1 

0 

0) 

niMp(ek)  = 

A 

msN{ek)  = 

1 

* 

mzE(t't-i)  = 

.1 

to,-  = 

min(.4,  1,  .1) 

=  .1  . 

So  the  system  activates  the  consequent  fuzzy  set  SP  to  degree  to,  =  .1. 

The  output  fuzzy  set’s  shape  depends  on  the  FAM-rule  encoding  scheme  used.  With 
correlation-minimum  encoding,  we  clip  the  consequent  fuzzy  set  L,  in  the  library  of  fuzzy- 
set  values  to  degree  to,  with  pointwisc  minimum: 

moiiv)  =  min(to„m/,,(y))  .  (2) 

With  correlation-product  encoding,  we  multiply  L,  by  to,: 

mo.(?/)  =  «>;  m/Jy)  ,  (3) 

or  equivalently, 


O,  =  to,  L,  . 


(4) 


Figure  3  illustrates  how  both  inference  procedures  transform  L,  to  scaled  output  0,.  For 
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the  example  above,  correlation-product  inference  gives  output  fuzzy  set  O,  =  .ISP, 
where  /,,  =  SP  denotes  the  fuzzy  set  of  small  but  positive  angular  velocity  values. 


Consequent  L  f  Output  O  j 


Consequent  L ,  Output  O  ■, 


FIGURE  3  FAM  inference  procedure  depends  on  FAM  rule  encoding  proce¬ 
dure:  (a)  correlation-minimum  encoding,  (b)  correlation-product  encoding. 


The  fuzzy  system  activates  each  FAM  rule  consequent  set  to  a  different  degree.  For  the 
ith  FAM  rule  this  yields  the  output  fuzzy  set  O,.  The  system  then  sums  the  O,-  to  form 
the  combined  output  fuzzy  set  0 : 


Ar 

o  =  T.o,  , 

;=i 

or  equivalently, 

N 

mo{y)  =  Y.inoXv)  ■ 

«= I 

T  he  control  output  ecjuals  the  fuzzy  centroid  of  O: 


Vk 


f  y  "io{y)<iy 
J  ™o{y)dy 


(5) 


(6) 


(7) 
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where  the  limits  of  integration  correspond  to  the  entire  universe  of  discourse  Y  of  angular 
velocity  values.  Figure  4  shows  an  example  of  correlation-product  inference  for  two  FAM 
rules  followed  by  centroid  defuzzification  of  the  combined  output  fuzzy  set. 


SP  ZE 


ZE  SP 


II  ek  =  SP  and  ek  =  ZE  and  vk., 
(hen  vk  =  SP. 


II  ek  =  ZE  and  ek  =  SP  and  vk  , 
then  vk  r=  ZE. 


FIGURE  4  Correlation-product  inferences  followed  by  centroid  defuzzifi¬ 
cation.  FAM  rule  antecedents  combined  with  AND  use  the  minimum  fit  value 
to  activate  consequents.  Those  combined  with  OR  use  the  maximum  fit  value. 


To  reduce  computations,  we  can  discretize  the  output  universe  of  discourse  Y  to  p  values, 
Y  =  {t/!,  . . .  ,  ?/p},  which  gives  the  discrete  fuzzy  centroid 


Ew  mo{yj) 

j=i 

Vk  =  — - 

j= i 


(8) 


Fuzzy  Centroid  Computation 


We  now  develop  two  discrete  methods  for  computing  the  fuzzy  centroid  (7).  Theorem 
1  stales  that  we  can  compute  the  global  centroid  Vk  from  local  FAM-rule  centroids.  The¬ 
orem  2  states  that  Vk  can  be  computed  from  only  7  sample  points  if  all  the  fuzzy  sets 
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are  symmetric  and  unimodal  (in  the  broad  sense  of  a  trapezoid  peak),  though  otherwise 
arbitrary.  Botli  results  reduce  computation  and  favor  digital  implementation. 


Theorem  1:  If  correlation-product  inference  determines  the  output  fuzzy  sets,  then  we 

can  compute  the  global  centroid  v *  from  local  FAM-rule  centroids: 


v,  = 


N 

Ylw'Cxh 

1=1 

~N 


(9) 


Proof.  The  consequent  fuzzy  set  of  each  FAM  rule  equals  one  of  the  fuzzy-set  values 
shown  in  Figure  2.  We  assume  each  fuzzy  set  includes  at  least  one  unity  value,  rnA{x)  —  1. 
Define  /,  and  c,  as  the  respective  area  and  centroid  of  the  zth  FAM  rule’s  consequent  set 
Li : 


h  = 


Ci  = 


J  ™Li(y)dy  , 

j  y^LiWy 
J  mLi{y)dy 

J  y  mL.(y)dy 


(10) 


/, 


substituting  from  (10).  Hence 


J  V  miMdy  =  c{  I, 


Using  (3),  the  result  of  correlation-product  inference,  we  get 


J  y  mo.(y)dy  =  J  y  »'/..(.vKv 


1 1 


(li) 


(12) 
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(16) 


N 

Vk  =  ,Jji 

X>-7> 

1=1 

which  is  equivalent  to  (9).  Each  summand  in  each  summation  of  (16)  depends  on  only 
a  single  FAM  rule.  So  we  can  compute  the  global  output  centroid  from  local  FAM-rule 
centroids.  Q.E.D. 


Theorem  2:  If  the  7  library  fuzzy  sets  are  symmetric  and  unimodal  (in  the  trapezoidal 

sense)  and  we  use  correlation-product  inference,  then  we  can  compute  the  centroid  v k  from 
only  7  samples  of  the  combined  output  fuzzy  set  0: 


Ylmo{Vi)  Vi  Ji 

j= i 

vk  =  — y - 

J2mo(yj)  Jj 

;=i 


The  7  sample  points  are  the  centroids  of  the  output  fuzzy-set  values. 


(17) 


Proof.  Define  O,  as  a  fit  vector  of  length  7,  where  the  fit  value  corresponding  to 
the  ith  consequent  set  has  the  value  u>,,  and  the  other  entries  equal  zero.  If  all  the  fuzzy 
sets  are  symmetric  and  unimodal,  then  the  jth  fit  value  of  O,  is  a  sample  of  mo,  at  the 
centroid  of  the  j th  fuzzy  set.  The  combined  output  fit  vector  is 

O  =  £0,  .  (18) 

«=i 

Since 

N 

mo{y )  =  5Z'»o,(!/)  , 

1  =  1 

the  jth  fit  value  of  O  is  a  sample  of  mo  at  the  centroid  of  the  jth  fuzzy  set..  Equivalently, 

the  j  th  fit  value  of  O  equals  the  sum  of  the  output  activations  w,  from  the  FAM  rules  with 
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consequent  fuzzy  sets  equal  to  the  j th  library  fuzzy-set  value. 

Define  the  reduced  universe  of  discourse  as  Y  =  {yt, . . .  ,y7 }  such  that  yj  equals  the 
centroid  of  the  jth  output  fuzzy  set.  In  vector  form 

Y  = 

=  (-6,  -4,  -2,  0,  2,  4,  6) 

for  the  library  of  fuzzy  sets  in  Figure  2.  Also  define  the  diagonal  matrix 

J  =  diag(Ji,...,  J7)  ,  (19) 


where  Jj  denotes  the  area  of  the  j  th  fuzzy-set  value.  If  the  zth  FAM  rule’s  consequent  fuzzy 
set  equals  the  jth  fuzzy-set  value,  then  the  j  th  fit  value  of  O  increases  by  to,,  c,  =  yj, 
and  /,  =  Jj.  So 

7  N 

ojyt  =  Yj  mo(yj)yjJj  =  £«W/,  .  (20) 


Also, 


ojyt  =  Ymo(yj)yjJj  =  • 

i=i  «=i 

°JlT  =  Ymo(yj)Jj  =  Y WiIi  » 

j=i  »=i 


where  1  =  (l,...,l).  Substituting  (20)  and  (21)  into  (16)  gives 


Vk  = 


Ymo(Vj)  Vi  Jj 
j= i _ 

7 

Ymo(yj )  JJ 


which  is  equivalent  to  (17).  Therefore,  (22)  gives  a  simpler,  but  equivalent  form  of  the 
centroid  (7)  if  all  the  fuzzy  sets  are  symmetric  and  unimodal,  and  if  we  use  correlation- 
product  inference  to  form  the  output  fuzzy  sets  O,.  Q.E.D. 

Consider  a  fuzzy  controller  with  the  fuzzy  sets  defined  in  Figure  2,  and  7  FAM  rules 
with  the  following  outputs: 


M 


i  W{  Consequent 


0.0 

MP 

0.2 

SP 

1.0 

ZE 

0.4 

SN 

0.1 

SP 

0.8 

ZE 

0.6 

SN 

Figure  5  shows  the  combined  output  fuzzy  set  O,  with  the  SN,  ZE ,  and  SP  components 
displayed  with  dotted  lines.  Using  (7)  we  get  a  velocity  output  of  —0.452.  Alternatively, 
the  combined  output  fit  vector  O  equals  (0,  0,  1.0,  1.8,  0.3,  0,  0).  From  (22)  we  get 


vk  = 


—2x1  +  0  x  1.8  +  2  x  0.3 
1  +  1.8  +  0.3 


-0.452  . 
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-3 


0 


7 


3 


■7  -\ 

UNIVERSE  Of  DISCOURSE 

FIGURE  5  Output  fuzzy  set  O. 


Fuzzy  Controller  Implementation 


A  FAM  bank  or  “rulebase”  of  FAM  rules  defines  the  fuzzy  controller.  Each  FAM  rule 
associates  one  consequent  fuzzy  set  with  three  antecedent  fuzzy-set  conjuncts. 

Suppose  the  tth  FAM  rule  is  (M P,SN,  ZE\SP).  Suppose  the  inputs  at  time  k  are 
e*;  =  2.6,  ek  -  —2.0,  and  vk-i  =  1.8.  Then 


Wi  =  min(mA/p(ej!..),  m$N{ck),  mZE(vk^i)) 

=  min(.4,  1,  .1) 

=  .1  . 

If  all  the  fuzzy  sets  have  the  same  shape,  then  they  correspond  to  shifted  versions  of  a 
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single  fuzzy  set  ZE: 


mSr{y)  =  mze(y  ~  2) 


Define  e',  e‘,  and  u'  as  the  centroids  of  the  corresponding  antecedent  fuzzy  sets  in  the 
example  above.  So  e*  =  4,  e*  —  —2,  and  u*  =  0.  Then  the  output  activation  equals 

toi  =  min (m2E(efc  -  e‘),  mZ£(efc  -  e*),  mZE(ufc_,  -  u*)) 

=  min(mZ£(— 1.4),  mZ£(0),  mZ£(1.8)) 

=  min(.4,  1,  .1) 


as  computed  above.  Figure  6  schematizes  such  a  FAM  rule  when  presented  with  crisp 


m7P.(-) 


Correlation-Product 

Inference 


FIGURE  6  Algorithmic  structure  of  a  FAM  rule  for  the  special  case  of 
identically-shaped  fuzzy  sets  and  correlation-product  inference. 


The  output  fuzzy  set  0,  in  Figure  6  equals  the  fuzzy  set  ZE  scaled  by  and  shifted 
by  c,  : 

m0,{y)  =  t i>iinzE(y  -  Ci)  .  (23) 

Figure  7  illustrates  0^. 


mo  (y) 

■  ■ 


FIGURE  7  Trapezoidal  output  fuzzy  set  0,-. 

The  fuzzy  control  system  activates  a  bank  of  FAM  rules  operated  in  parallel,  as  shown 
in  Figure  8.  The  system  sums  the  output  fuzzy  sets  to  form  the  total  output  set  0,  which 
the  system  converts  to  a  “defuzzified”  scalar  output  by  computing  its  fuzzy  centroid. 
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FIGURE  8  Fuzzy  control  system  as  a  parallel  FAM  bank  with  centroidal 
output. 

KALMAN  FILTER  CONTROLLER 

We  designed  a  one-dimensional  Kalman  filter  to  act  as  an  alternative  controller.  The 
state  and  measurement  equations  take  the  general  form 

Xk+ 1  =  xk  +  rfc+1,t  wk  +  tyjt+ijt  uk  , 

zk  =  II k  xk  +  Vk  ,  (24) 

where  14  denotes  Gaussian  white  noise  with  covariance  matrix  Rk.  If  Vk  is  colored  noise 
or  if  Rk  —  0,  then  the  filtering-error  covariance  matrix  Pk\k  becomes  singular.  The  state  xk 
and  the  measurements  zk  are  jointly  Gaussian.  Mendel  [1987J  gives  details  of  this  model. 

19 


Ass  ime  the  following  one-dimensional  model: 

^/fc+i.fc  =  l\+u  =  =  Hk  =  1  for  all  k, 

uk  =  ek  +  e/t  .  (25) 

Let  xk+i  denote  the  output  velocity  required  at  time  k  to  exactly  lock  onto  the  target  at 
timefc-fl.  So  the  controller  output  at  time  k  equals  the  “predictive”  estimate  f/t+i|fc  = 
Note  that 

e/t  =  zit  - 
=  ifcp-i  » 

Ck  =  ek  —  ek~i  . 

Substituting  (25)  into  (24),  we  get  the  new  state  equation 

*k+ i  =  xk  +  e*  +  efc  +  ,  (26) 

where  wk  denotes  white  noise  that  models  target  acceleration  or  other  unmodeled  effects. 
The  new  measurement  equation  is 

Zk  =  xk  +  14 

=  £jt|fc-i  +  +  '4  (27) 

=  i  +  K  • 

Since  we  assume  xk\k-i  and  Vfc  are  uncorrelated,  the  variance  of  Vk  is 

K  =  E\v?\ 

=  A’|iJ|t-,l  +  R\vi\  (28) 
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—  Pk\k-l  +  fik 

Tlic  general  form  of  the  recursive  Kalman  filter  equations  is 
ik\k  =  ik\k-i  +  Kk[zk  ~  Hk  £jt|fc_i]  , 

Kk  =  i\\k-xiil[HkPk\k^it'lk  +  /4]-1  , 

ifc+i|fc  =  $k+i,k  +  'Pjt+i.fc  u-k  ,  (29) 

Pk\k-l  =  $k,k~t  A— +  ^k,k-lQk-\Pjtk-l  i 

Pk\k  =  [/  -  /u/4)/V_.  , 

where  Qk  =  Var(wk)  —  E[wkw\  ].  Substituting  (25),  (27),  (28)  and  the  definition  of  Vk 
into  (29),  we  get  the  following  one-dimensional  Kalman  filter: 


=  Vk-l 

+  KkVl  , 

Kk 

II 

in  ^ 

-1 

Vk 

=  Xk\k 

+  ek  +  e-k  , 

Pk\k-1 

—  +  Qk- 1  5 

P t|fc 

=  [1  ~ 

I<k)Pk\k-i  ■ 

Unlike  the  fuzzy  controller,  this  Kalman  filter  does  not  automatically  restrict  the  output 
Vk  to  a  usable  range.  We  must  apply  a  threshold  immediately  after  the  controller.  To 
remain  consistent  with  the  fuzzy  controller,  we  set  the  following  thresholds: 


| Vk |  <  9  degrees  azimuth  , 

\vk |  <  4.5  degrees  elevation. 
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Fuzzy  and  Kalman  Filter  Control  Surfaces 


Each  control  system  maps  inputs  to  outputs.  Geometrically,  these  input-output  trans¬ 
formations  define  control  surfaces.  The  control  surfaces  are  sheets  in  the  input  space 
(since  the  output  velocity  u*.  is  a  scalar).  Three  inputs  and  one  output  give  rise  to  a 
four-dimensional  control  surface,  which  we  cannot  plot.  Instead,  for  each  controller  we  can 
plot  a  family  of  three-dimensional  control  surfaces  indexed  by  constant  values  of  the  fourth 
variable,  the  error  e*,  say.  Then  each  control  surface  corresponds  to  a  different  value  of 
the  error  e*. 

The  fuzzy  control  surface  characterizes  the  fuzzy  system’s  fuzzy-set  value  definitions 
and  its  bank  of  FAM  rules.  Different  sets  of  FAM  rules  yield  different  fuzzy  controllers, 
and  hence  different  control  surfaces.  Figure  9  shows  a  cross  section  of  the  FAM  bank  when 
et  =  ZE.  Each  entry  in  this  linguistic  matrix  represents  one  FAM  rule  with  e*.  =  ZE 
as  the  first  antecedent  term. 
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FIGURE  9  Cfc  =  ZE  cross  section  of  the  fuzzy  control  system’s  FAM  bank. 
Each  entry  represents  one  FAM  rule  with  e*.  =  ZE  as  the  first  antecedent  term. 
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The  shaded  FAM  rule  is  “IF  e*  =  ZE  AND  Ck  =  SP  AND  Vk~i  =  57V, 
THEN  Vk  =  ZE,"  abbreviated  as  (Z E,  S P,  SN\  Z E).  Note  the  ordinal  anti¬ 
symmetry  of  this  FAM-bank  matrix.  The  six  other  cross-section  FAM-bank 
matrices  are  similar.  We  can  eliminate  many  FAM  rule  entries  without  greatly 
perturbing  the  fuzzy  controller’s  behavior. 


The  entire  FAM  bank — including  cross  sections  for  Ck  equal  to  each  of  the  seven  fuzzy- 
set  values  LTV,  M TV,  57V,  ZE,  SP,  M P,  and  LP — determines  how  the  system  maps  input 
fuzzy  sets  to  output  fuzzy  sets.  The  fuzzy  set  membership  functions  shown  in  Figure  2 
determine  the  degree  to  which  each  crisp  input  value  belongs  to  each  fuzzy-set  value.  So 
both  the  fuzzy-set  value  definitions  and  the  FAM  bank  determine  the  defuzzified  output 
Vk  for  any  set  of  crisp  input  values  Ck,  e k,  and  Vk-\. 

Figure  10  shows  the  control  surface  of  the  fuzzy  controller  for  =  0.  We  plotted  the 
control  output  Vk  against  e*  and  Vk~\.  Since  we  use  the  same  algorithm  for  tracking  in 
azimuth  and  elevation,  the  control  surfaces  for  the  two  dimensions  differ  in  scale  only  by 
a  factor  of  two. 


FIGURE  10  Control  surface  of  the  fuzzy  controller  for  constant  error 
<  k  =  0.  We  plotted  the  control  output  vk  against  c*  and  Vk- 1  along  the 

respective  west  and  south  borders. 


I’he  Kalman  filter  has  a  random  control  surface  that  depends  on  a  time-varving  pa- 


rameter.  From  (30)  wc  see  that 


vk  =  +  ck  +  <'k  , 

=  Vk- 1  +  A  JtVfc  , 


where  V*  denotes  white  noise  with  variance  given  by  (28).  Combining  these  two  equations 
gives  the  equation  for  the  random  control  surface: 


Vk  =  Vk- 1  +  Ck  +  Ck  +  /ql'i 


(31) 


At  time  k  the  noise  term  Kk^k  ^ias  variance 


a 


2 

k 


Kl  R’, 


(32) 


upon  substituting  from  (30)  , 


AV-, 


P k\k-l  +  <<k 


substituting  from  (28).  Combining  (31)  and  (32)  gives  a  new  control  surface  equation: 


Vk  =  Vk- 1  +  Ck  +  Ck  +  Ok  V^' 


(33) 


where  V£'  denotes  unit-variance  Gaussian  noise.  So  the  Kalman  filter’s  control  output 
equals  the  sum  of  the  three  input  variables  plus  additive  ( laussian  noise  with  time-dependent 
variance  oj\.  For  constant  error  r*.,  we  can  interpret  (33)  as  a  smooth  control  surface  in  / P 
defined  by 


'■/.  Vk  |  1  <  k  I  <  k  , 
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and  perturbed  at  time  k  by  Gaussian  noise  with  variance  a\. 

In  our  simulations  the  standard  deviation  <7;.  converged  after  only  a  few  iterations.  We 
used  unity  initial  conditions:  P0\0  =  Rk  —  1  for  all  k. 

Table  1  lists  tiie  convergence  rates  and  steady-state  values  of  <7*.  for  three  differen- 
t  values  of  the  variance  Var(iv)  of  the  white-noise,  unmodeled-elTects  process  te*.  For 
Var(w)  —  0,  <7 k  decreases  rapidly  at  first — <78  =  .10,  er,7  =  .05 — but  does  not  attain 
its  steady-state  value  of  zero  within  100  iterations. 


Var(iu)  j 

Steady-state 

value  of  Ok 

Number  of  iterations 

required  for  convergence 

1.00 

0.79 

2 

0.25 

0.46 

4 

0.05 

0.22 

9 

TABLE  1  Convergence  rates  and  steady-state  values  of  cr*  for  different  val¬ 
ues  of  the  variance  Var(w)  of  the  white-noise,  unmodeled-effects  process  Wk- 


Figure  11  shows  four  realizations  of  the  Kalman  filter’s  random  control  surface  for 
ck  =  0,  each  at  a  time  k  when  Ok  has  converged  to  its  steady-state  value.  For  each  plot,  we 
used  output  thresholds  and  initial  variances  for  the  azimuth  case:  |v/t|  <  9.0,  Rk  =  R0\„ 
—  1.0.  As  with  the  fuzzy  controller,  elevation  control  surfaces  equal  scaled  versions  of  the 
corresponding  azimuth  control  surfaces. 


(c) 


(d) 


FIGURE  11  Realizations  of  the  Kalman  filter’s  random  control  surface 
with  ek  —  0  for  different  values  of  the  variance  V ar(w )  and  steady-state  values 
of  the  standard  deviation  <7 jt:  (a)  V ar(xu)  =  a k  =  0,  (b)  Var(w)  =  .05, 

<7h  —  .22;  (c)  Var(w)  =  .25,  ak  =  .40;  (d)  Var(w)  =  1.0,  uk  —  .79. 


SIMULATION  RESULTS 


Our  target-tracking  simulations  model  several  reahvorld  scenarios.  Suppose;  we  have 
mounted  the  target  tracking  system  on  the  side  of  a  vehicle,  aircraft,  or  ship.  The  system 
tracks  a  missile  that  cuts  across  the  detection  range;  on  a  straight  flight  path.  The  target 
maintains  a  constant  spee-el  e>f  1,870  mile;s-per-hour  and  cenne's  within  .1.5  mile's  of  the 


platform  at  closest  approach.  The  platform  can  scan  from  0  to  180  degrees  in  azimuth  at 
a  maximum  rate  of  3G  degrees- per-second,  and  from  0  (vertical)  to  90  degrees  in  elevation 
at  a  maximum  rate  of  18  degrees- per-second.  The  sampling  interval  is  1/4  of  a  second. 
I'he  gain  of  the  fuzzy  controller  equals  0.9.  So  the  maximum  error  considered  is  10  degrees 
azimuth  and  5  degrees  elevation.  We  threshold  all  error  values  above  this  level. 

Figure  12  demonstrates  the  best  performance  of  the  fuzzy  controller  for  a  simulated 
scenario.  The  solid  lines  indicate  target  position.  The  dotted  lines  indicate  platform 
position.  To  achieve  this  performance,  we  calibrated  the  three  design  parameters— upper 
and  lower  trapezoid  bases  and  the  gain.  Figures  13  and  14  show  examples  of  uncalibrated 
systems.  Too  much  overlap  causes  excessive  overshoot.  Too  little  overlap  causes  lead  or 
lag  for  several  consecutive  time  intervals.  A  gain  of  0.9  suffices  for  most  scenarios.  We 
can  fine-tune  the  fuzzy  control  system  by  altering  the  percentage  overlap  bet weerr  adjacent 
fuzzy  sets. 

Figure  15  demonstrates  the  best  performance  of  the  Kalman-filter  controller  for  the 
same  scenario  used  to  test  the  fuzzy  controller.  For  simplicity,  Rk  =  P0|o  for  all  values  of 
k.  For  this  study  we  chose  the  values  1.0  (unit  variance)  for  azimuth  and  0.25  for  eleva¬ 
tion.  This  1/4  ratio  reflects  the  difference  in  scanning  range.  We  set  Qk  to  0  for  optimal 
performance.  Figure  16  shows  the  Kalman-filter  controller’s  performance  when  Qk  =  1.0 
azimuth,  0.25  elevation. 


Sensitivity  Analysis 


We  compared  (  lie  uncertainty  sensitivity  of  the  fuzzy  and  Kalman-filter  control  systems. 
Under  normal  operating  conditions,  when  the  FAM  bank  contains  all  fuzzy  control  rules, 
and  when  the  unmodeled-effects  noise  variance  Var(w)  is  small,  the  controllers  perform 
almost  identically.  Under  more  uncertain  conditions  their  performance  differs.  The  Kalman 
filter’s  state  equation  (26)  contains  the  noise  term  wk  whose  variance  we  must  assume. 
When  V nr(w)  increases,  the  state  equation  becomes  more  uncertain.  The  fuzzy  control 
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FAM  rules  depend  implicitly  on  this  same  equation,  but  without  the  noise  term.  Instead, 
the  fuzziness  of  the  FAM  rules  accounts  for  the  system  uncertainty.  This  suggests  that  we 
can  increase  the  uncertainty  of  the  implicit  state  equation  by  omitting  randomly  selected 
FAM  rules.  Figures  17  and  18  show  the  effect  on  the  root-mean-squared  error  (RMSE)  in 
degrees  when  we  omit  FAM  rules  and  increase  Var(w).  Each  data  point  averages  ten  runs. 

The  controllers  behave  differently  as  uncertainty  increases.  The  RMSE  of  the  fuzzy 
controller  increases  little  until  we  omit  nearly  sixty  percent  of  the  FAM  rules.  The  RMSE 
of  the  Kalman  filter  increases  steeply  for  small  values  of  Var(w),  then  gradually  levels  off. 

We  also  tested  the  fuzzy  controller’s  robustness  by  “sabotaging”  the  most  vulnerable 
FAM  rule.  This  could  reflect  lack  of  accurate  expertise,  or  a  highly  unstructured  problem. 
Changing  the  consequent  of  the  steady-state  FAM  rule  ( ZE ,  ZE,  ZE\  ZE)  to  LP  gives  the 
following  nonsensical  FAM  rule: 

IF  the  platform  points  directly  at  the  target 

AND  both  the  target  and  the  platform  are  stationary, 

THEN  turn  in  the  positive  direction  with  maximum  velocity. 

Figure  19  shows  the  fuzzy  system’s  performance  when  this  sabotage  FAM  rule  replaces 
the  steady-state  FAM  rule.  When  the  sabotage  FAM  rule  activates,  the  system  quickly 
adjusts  to  decrease  the  error  again.  The  fuzzy  system  is  piecewise  stable. 
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FIGURE  12  Best  performance  of  the  fuzzy  controller:  (a)  azimuth  position 
and  error,  (b)  elevation  position  and  error.  Fuzzy  set  overlap  is  26.2%. 
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FIGURE  13  Uncalibrated  fuzzy  controller:  (a)  azimuth  position  and  error, 
(b)  elevation  position  and  error.  Fuzzy  set  overlap  equals  33.3%.  Too  much 
overlap  causes  excessive  overshoot. 
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FIGURE  14  Uncalibratcd  fuzzy  controller:  (a)  azimuth  position  and  error, 
(b)  elevation  position  and  error.  Fuzzy  set  overlap  equals  12.5%.  Too  little 
overlap  causes  lead  or  lag  for  several  consecutive  time  intervals. 
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FIGURE  17  Root-mean-squared  error  of  the  fuzzy  controller  with  random¬ 
ly  selected  FAM  rules  omitted. 
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FIGURE  18  Root-mean-squared  error  of  the  Kalman  filter  controller  as 
Var(w)  varies. 
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FIGURE  19  Fuzzy  controller  with  a  “sabotage”  FA M  rule:  (a)  azimuth  po¬ 
sition  and  error,  (b)  elevation  position  and  error.  The  sabotage  rule  (ZE,  ZE,  ZE;  I^P) 
replaces  the  steady-state  FAM  rule  (Z E,  Z E,  Z E;  Z E).  The  system  quickly 
adjusts  each  time  the  sabotage  rule  activates. 


Adaptive  FAM  (AFAM) 


Wc  used  unsupervised  product-space  clustering  [Kosko,  1990a]  to  train  an  adaptive 
FAM  (AFAM)  fuzzy  controller.  Differential  competitive  learning  (DCL)  adaptively  clus¬ 
tered  input-output  pairs.  The  Appendix  describes  product-space  clustering  with  DCL.  For 
this  study,  there  were  four  input  neurons  in  Fx.  A  manually-designed  FAM  bank  and  80 
random  target  trajectories  generated  19,230  training  vectors.  Each  product-space  training 
vector  (ck,  et,  ut_i ,  u*)  defined  a  point  in  II1. 

Symmetry  allowed  us  to  reflect  about  the  origin  all  sample  vectors  with  negative  errors 
e/;.  We  then  trained  3,000  synaptic  quantization  vectors  (/>  =  3,000)  in  the  positive  error 
half-space.  For  each  sample  vector,  we  defined  the  10  closest  synaptic  vectors  as  “winners” 
(N  =  10).  The  matrix  W  of  Fy  within-field  synaptic  connection  strengths  had  diagonal 
elements  tva  =  2.9,  off-diagonal  elements  w{]  =  -0.1.  After  training,  we  reflected  the 
3,000  synaptic  quantization  vectors  about  the  origin  to  give  6,000  trained  synaptic  vectors. 

The  product-space  FAM  cells  uniformly  partitioned  the  four-dimensional  product 
space.  Each  FAM  cell  represented  a  single  FAM  rule.  The  four  fuzzy  variables  could  assume 
only  the  7  fuzzy-set  values  LN,  MN ,  SN ,  ZE,  SP ,  MP ,  and  LP.  So  the  product  space 
contained  74  =  2401  FAM  cells. 

At  the  end  of  the  DCL  training  period,  we  defined  a  FAM  cell  as  occupied  only  if  it 
contained  at  least  one  synaptic  vector.  For  some  combinations  of  antecedent  fuzzy  sets, 
synaptic  vectors  occupied  more  than  one  FAM  cell  with  different  consequent  fuzzy  sets.  In 
these  cases  we  computed  the  centroid  of  the  consequent  fuzzy  sets  weighted  by  the  number 
of  synaptic  vectors  in  their  FAM  cells.  We  chose  the  consequent  fuzzy  set  as  that  output 
fuzzy-set  value  with  centroid  nearest  the  weighted  centroid  value.  We  ignored  other  FAM 
rules  with  the  same  antecedents  but  different  consequent  fuzzy  sets. 

Figure  20(a)  shows  the  =  Z  F  cross  section  of  the  original  FAM  bank  used  to 

generate  the  training  samples,  figure  20(b)  shows  the  same  cross  section  of  the  D(  L- 
estimated  FAM  bank,  f  igure  21  shows  the  original  and  DCL-estimated  control  surfaces 


for  constant  error  c />.  =  0. 

The  regions  where  the  two  control  surfaces  difTer  correspond  to  infrequent  high-velocity 
situations.  So  the  original  and  DC L-csti mated  control  surfaces  yield  similar  results.  Table 
2  compares  the  controllers’  root-mean-squared  errors  for  10  randomly-selected  target  tra¬ 
jectories. 


(a)  (b) 


FIGURE  20  Cross  sections  of  the  original  and  DCL-  estimated  FAM  banks 
when  Cfc  =  ZE:  (a)  original,  (b)  DCL-  estimated. 


FIGURE  21  Control  surfaces  for  constant  error  e*  =  0:  (a)  original, 

(b)  DCL-estimated. 


38 


Trajectory 

Number 

A  zirnulh 

Original  Estimated 

Elevation 

Original  Estimated 

1 

2.33 

2.33 

3.31 

3.37 

2 

4.14 

4.14 

3.03 

2.89 

3 

G.  1 1 

G.l  1 

3.G9 

3.G8 

4 

3.83 

3.83 

3.32 

3.30 

5 

4.02 

4.02 

3.11 

3.10 

G 

2.84 

2.84 

1.20 

1.21 

7 

3.22 

3.22 

3.04 

2.98 

8 

0.75 

0.74 

2.00 

2.00 

9 

9.28 

9.27 

5.50 

5.41 

10 

1.81 

1.81 

2.29 

2.29 

Average 

3.83 

3.83 

3.05 

3.02 

TABLE  2  Root-mean-squared  errors  for  10  randomly-selected  target  tra¬ 
jectories.  The  original  and  DCL-estimated  FAM  banks  yielded  similar  results 
since  they  differed  only  in  regions  corresponding  to  infrequent  high-velocity 
situations. 


Conclusion 


We  developed  and  compared  a  fuzzy  control  system  and  a  Kalman-filter  control  system 
for  realtime  target  tracking.  The  fuzzy  system  represented  uncertainty  with  continuous  or 
fuzzy  sets,  with  the  partial  occurence  of  multiple  alternatives.  The  Kalman-filter  system 
represented  uncertainty  with  the  random  occurence  of  an  exact  alternative.  Accordingly, 
our  simulations  tested  each  system’s  response  to  a  different  family  of  uncertainty  envi¬ 
ronments,  one  fuzzy  and  the  other  random.  In  general  representative  training  data  can 
“blindly”  generate  the  governing  FAM  rules. 
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These  simulations  suggest  that  in  many  cases  fuzzy  controllers  may  be  a  robust,  com¬ 
putationally  effective  alternative  to  linear  Kalman  filter,  indeed  to  nonlinear  extended 
Kalman  filter,  approaches  to  realtime  system  control — even  when  we  can  accurately  artic¬ 
ulate  an  input-output  math  model. 
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Appendix:  Product-space  Clustering  with 
Differential  Competitive  Learning 

Adaptive  Vector  Quantization 

Product-space  clustering  [Kosko,  1990a]  is  a  form  of  stochastic  adaptive  vector  quanti¬ 
zation.  Adaptive  vector  quantization  (AVQ)  systems  adaptively  quantize  pattern  clusters 
in  Rn.  Stochastic  competitive-learning  systems  are  neural  AVQ  systems.  Neurons  com¬ 
pete  for  the  activation  induced  by  randomly  sampled  patterns.  The  corresponding  fan-in 
vectors  adaptively  quantize  the  pattern  space  Rn.  The  p  synaptic  vectors  define  the 
p  columns  of  the  synaptic  connection  matrix  M.  M  interconnects  the  n  input  or  linear 
neurons  in  the  input  neuronal  field  Fx  to  the  p  competing  nonlinear  neurons  in  the  output 
field  Fy.  Figure  22  below  illustrates  the  neural  network  topology. 

Learning  algorithms  estimate  the  unknown  probability  density  function  p*x),  which 
describes  the  distribution  of  patterns  in  Rn.  More  synaptic  vectors  arrive  at  more  probable 
regions.  Where  sample  vectors  x  are  dense  or  sparse,  synaptic  vectors  nr,  should  be  dense 
or  sparse.  The  local  count  of  synaptic  vectors  then  gives  a  nonparametric  estimate  of  the 
volume  density  P{V)  for  volume  V  C  Rn'- 

1>(V)  =  J  ,,(x)dx 

Number  of  in,  6  V 
V 
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(31) 

(35) 


In  the  extreme  case  that  V  =  Rn,  this  approximation  gives  P{V)  =  p/p  —  I.  For 
improbable  subsets  V,  P(V)  =  0/p  =  0. 

Stochastic  Competitive  Learning  Algorithms 

The  metaphor  of  competing  neurons  reduces  to  nearest-neighbor  classification.  The 
AVQ  system  compares  the  current  vector  random  sample  x(Z)  in  Euclidean  distance  to  the 
p  columns  of  the  synaptic  connection  matrix  M,  to  the  p  synaptic  vectors  ,  mp(Z). 

If  the  jth  synaptic  vector  mj(<)  is  closest  to  x(Z),  then  the  jth  output  neuron  “wins”  the 
competition  for  activation  at  time/.  In  practice  we  sometimes  define  the  nearest  N  synaptic 
vectors  as  winners.  Some  scaled  form  of  x(Z)  —  m j(t)  updates  the  nearest  or  “winning” 
synaptic  vectors.  “Losers”  remain  unchanged:  m,(/+‘l)  =  m,(Z).  Competitive  synaptic 
vectors  converge  to  pattern-class  centroids  exponentially  fast  [Kosko,  1990b]. 

The  following  three-step  process  describes  the  competitive  AVQ  algorithm,  where  the 
third  step  depends  on  which  learning  algorithm  updates  the  winning  synaptic  vectors. 

Competitive  AVQ  Algorithm 

1.  Initialize  synaptic  vectors  m,(0)  =  x(t'),  i  =  1  Sample-dependent  initial 

ization  avoids  many  pathologies  that  can  distort  nearest-neighbor  learning. 

2.  For  random  sample  x(Z),  find  the  closest  or  “winning”  synaptic  vector  mj(Z): 

1 1 1T1j  ( 0  -  X(Q||  =  min  ||m,(<)  -  x(Q||  ,  (36) 

t 

where  ||x||2  —  .r,  +  . . .  +  .r2  defines  the  squared  Euclidean  vector  norm  of  x.  We 
can  define  the  l\r  synaptic  vectors  closest  to  x  as  “winners.” 
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3.  Update  the  winning  synaptic  vector(s)  nij(t)  with  an  appropriate  learning  algorithm. 


Differential  Competitive  Learning  (DCL) 


Differential  competitive  “synapses”  learn  only  if  the  competing  “neuron”  changes  its 
competitive  status  [Kosko,  1990c]: 


m,j  =  Sj(yj)[Si(xi)  -  mtj]  ,  (37) 

or  in  vector  notation, 

nij  =  5i(yi)[S(x)  -  m;]  ,  (38) 

where  S(x)  —  (5i(ii),...,Sn(i„))  and  nij  =  (m^, . . .  ,mnj)-  m,_,  denotes  the  synaptic 
value  between  the  zth  neuron  in  input  field  Fx  and  the  jth  neuron  in  competitive  field 
Fy ■  Nonnegative  signal  functions  5,  and  Sj  transduce  the  real- valued  activations  a:,  and 
yj  into  bounded  monotone  nondecreasing  signals  5,(x,)  and  Sj(ijj).  rhij  and  Sj(yj)  denote 
the  time  derivatives  of  m.y  and  Sj(yj),  synaptic  and  signal  velocities.  Sj(yj)  measures  the 
competitive  status  of  the  jth  competing  neuron  in  Fy.  Usually  Sj  approximates  a  binary 
threshold  function.  For  example,  Sj  may  equal  a  steep  binary  logistic  sigmoid, 


Si(Vj)  = 


1 

1  -f  e~cy> 


(39) 


for  some  constant  c  >  0.  The  jtli  neuron  wins  the  laterally  inhibitive  competition  if 

Sj  =  1,  loses  if  Sj  =  0. 

For  discrete  implementation,  we  use  the  DCL  algorithm  as  a  stochastic  difference 
equation  (Kong,  1991]: 


m;((  +  1)  =  in j(t)  +  ct  ASj(yj(t))[S(x(l))  —  m;(/)]  if  the  jth  neuron  wins,  (40) 


m,(*  +  1) 


m,-(f)  if  the  ith  neuron  loses. 


(41) 


A Sj(yj(t))  denotes  the  time  change  of  the  jth  neuron’s  competition  signal  Sj(yj)  in  the 
competition  layer  Fy: 

ASj(yA *))  =  sMSj{yAt  +  i))  -  s»(yy(0)]  •  (42) 

We  define  the  signum  operator  sgn(x)  as 

1  if  x  >  0 

sgn(x)  =  <  0  if  x  =  0  .  (43) 

—  1  if  x  <  0 

{c<}  denotes  a  slowly  decreasing  sequence  of  learning  coefficients,  such  as  ct  =  .1(1  — 
f/2000)  for  2000  training  samples.  Stochastic  approximation  [Huber,  1981]  requires  a  de¬ 
creasing  gain  sequence  {cf}  to  suppress  random  disturbances  and  to  guarantee  convergence 
to  local  minima  of  mean-squared  performance  measures.  The  learning  coefficients  should 
decrease  slowly, 

OO 

=  ° °  ,  (44) 

<=i 

but  not  too  slowly, 

OO 

<  oo  ’  (45) 

t=i 

Harmonic-series  coefficients,  ct  =  1/f,  satisfy  these  constraints. 

We  approximate  the  competitive  signal  difference  A S:  as  the  activation  difference  A xjf. 


&S](llj{t))  =  sgn[yj{t  +  1)  -  y;(01 


(46) 


=  A//j(0 


(47) 
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Input  neurons  in  feedforward  networks  usually  behave  linearly:  S,(x,)  =  x,,  or  S(x(t))  --  x(t). 
Then  e  update  the  winning  synaptic  vector  irij(t)  with 

nij(t  +  1)  =  m,(0  +  ct  Ayj(t)[x(t)  -  m/(0]  .  (48) 

Wo  update  the  Fy  neuronal  activations  yj  with  the  additive  model 

n  P 

yj(t  +  0  =  vA 0  +  X)Si(x,-(0)m;>(0  +  Y  Sk(yk(t))wkj  -  (49) 

<  * 

For  linear  signal  functions  5t,  the  first  sum  in  (49)  reduces  to  an  inner  product  of  sample 
and  synaptic  vectors: 


=  xr(<)rn,(<)  .  -  (50) 

f 

Then  positive  learning  tends  to  occur — Am,_,  >  0 — when  x  is  close  to  the  7th  synaptic 
vector  nij . 

Since  a  binary  threshold  function  approximates  the  output  signal  function  Sk(yk),  the 

second  sum  in  (49)  sums  over  just  the  winning  neurons:  YWki  for  all  winning  neurons  y*  . 

k 

The  p  x  p  matrix  W  contains  the  Fy  within-field  synaptic  connection  strengths.  Di¬ 
agonal  elements  tv  a  are  positive,  off-diagonal  elements  negative.  Winning  neurons  excite 
themselves  and  inhibit  all  other  neurons.  Figure  22  shows  the  connection  topology  of  the 
laterally  inhibitivc  DCL  network. 


Input  field  Fx  Competition  field  Fy 

FIGURE  22  Topology  of  the  laterally  inhibitive  DCL  network. 
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Abstract 

The  probabilistic  foundations  of  competitive  learning  systems  are  developed.  Continuous  and  discrete  for 
mutations  of  unsupervised,  supervised,  and  differentia]  competitive  learning  systems  are  studied.  The -r  systems 
estimate  an  unknown  probability  density  function  from  random  pattern  samples  and  behave  as  adaptive  vector 
quantizers.  Synaptic  vectors,  in  feedforward  competitive  neural  networks,  quantize  the  pattern  space  and  converge 
to  pattern  class  centroids  or  local  probability  maxima.  The  stochastic  calculus  and  a  Lyapunov  argument  prove 
that  competitive  synaptic  vectors  converge  to  centroids  exponentially  quickly.  Convergence  does  not  depend  on  a 
specific  dynamical  model  of  how  neuronal  activations  change. 

Feedforward  Multilayer  Competitive  Learning  Systems 

Competitive  learning  systems  are  usually  feedforward  multilayer  neural  networks.  Neurons  compete 
for  the  activation  induced  by  randomly  sampled  pattern  vectors  x  <  Rn .  An  unknown  probability  density 
function  p(x)  characterizes  the  continuous  distribution  of  random  pattern  vectors  x.  A  random  n- vector 
my  of  synaptic  values  fans-in  to  each  competing  neuron.  The  synaptic  vector  niy  is  the  /th  column  of  the 
synaptic  connection  matrix  M . 

Competition  selects  which  synaptic  vector  m;  is  modified  by  the  training  sample  x.  In  practice 
competition  is  a  metaphor  for  metrical  pattern  matching.  Neuronal  activation  dynamics  arc  seldom  used 
The  /tli  neuron  “wins”  at  an  iteration  if  the  synaptic  vector  my  is  the  closest,  in  Euclidean  distance,  of 
the  m  synaptic  vectors  to  the  random  pattern  x  sampled  at  that  iteration. 

Some  scaled  form  of  the  difference  vector  x  —  my  additively  modifies  the  closest  synaptic  vector 
my.  Different  scaling  factors  determine  different  competitive  learning  systems.  A  positive  scaling  factor 
“rewards”  the  winning /th  neuron  by  making  the  modified  synaptic  vector  m,  resemble  the  random  sample 
x  at  least  as  much  as  did  the  unmodified  synaptic  vector  my.  A  negative  scaling  factor  “punishes”  the  /th 
neuron  by  making  the  modified  synaptic  vector  illy  disresemble  x  more  than  did  (be  unmodified  synaptic 
vector.  Geometrically,  negative  scaling  tends  to  move  misclassifying  synaptic  vectors  in  li"  out  of  regions 
of  misclassification. 

Competitive  learning  systems  can  be  viewed  as  lion-neural  signal  processing  algorithms.  As  in  a 
correlation  detection  system,  random  inputs  are  metrically  compared  with  t  he  columns  of  the  matrix  M . 
Pattern  recognition  (signal  detection)  is  nearest -neighbor  template  matching.  During  training  and  the 
system  may  always  be  training  at  most  one  column  of  A/  is  modified  at  a  ( ,ime.  This  lightens  rompul  at  ion 
and  favors  digital  implementation 

Autoassociativc 1  competitive  learning  systems  have  two  layers  or  fields  of  neurons,  file  input  data 
field  I'x  of  n  neurons  passes  a  randomly  sampled  pattern  vector  x  forward  through  an  n  by  rn  matrix  M 
of  synaptic  values  to  m  “competing”  neurons  in  t  he  competitive  field  Fy  A  symmetric  m-by-tit  mat  rix 
IT  of  wit.hin-field  synaptic  values  describes  the  competition  in  Fy.  IT  has  a  positive  main  diagonal  (or 
diagonal  band)  with  negative  or  zero  values  elsewhere.  In  practice  IT  has  Is  down  its  main  diagonal  and 
—  Is  elsewhere.  Autoassociat i ve  competitive  learning  systems  recognize  patterns  More  generally,  they 
estimate  probability  density  functions  p(x). 

Hcte.roassocialtvr'  competitive  learning  systems  have  three  fields  of  neurons  Tin-  first,  and  third  fields 
are  input  and  output  data  fields  for  sampling  the  random  vector  association  (x,  •/.)  The  second  or  “hidden’ 
field  contains  the  competing  neurons.  The  first  and  third  fields  ran  be  concatenated  into  a  single  field  of 


n  -f  p  neurons.  Co  epetitive  learning  proceeds  as  in  the  autoassocialive  case.  Closest  matches  modify  the 
columns  of  tlio  '  ween-field  matrices  A/  and  N  of  synaptic  values  with  the  same  diirercncc- based  learning 
law. 

In  practice  heteroassociative  competitive  learning  systems  do  not  directly  estimate  an  unknown  joint 
probability  density  function  p(x,  z).  Instead  they  estimate  a  sampled  continuous  function  /  :  IV'  — *  IV' 
from  a  large  number  of  noisy  random  vector  samples  (xj,  zj).  Implicitly  the  functional  pair  (x<, 
belongs  to  a  high-probability  region  of  Rn  x  Rp .  These  heteroassociative  systems  can  be  directly  compared 
with  multilayer  neural  networks  trained  with  the  backpropagation  algorithm  and  with  the  same  training 
data.  The  competitive  systems  tend  to  learn  faster,  but  less  accurately,  at  least  for  small  dimensions  n 
and  p. 

Feedback  autoassociative  and  heteroassociative  competitive  learning  systems,  such  as  random  adaptive 
bidirectional  associative  memories16,  have  more  complicated  dynamics.  These  stochastic  systems  are  glob¬ 
ally  stable  if  within-field  competition  is  symmetric  and  between-field  synaptic  values  are  symmetrizable. 
Our  discussion  focuses  more  on  encoding  than  on  decoding  properties  of  competitive  learning  systems. 
So  the  discussion  is  limited  to  feedforward  autoassociative  competitive  learning  systems  for  estimating 
unknown  probability  density  functions  p(x).  Heteroassociative  extensions  are  immediate. 

Competitive  Learning  as  Adaptive  Vector  Quantization 

Competitive  learning  systems  adaptively  quantize  the  pattern  space  Rn .  The  synaptic  vector  my 
represents,  or  “rounds  off,”  the  local  region  about  my.  Each  syn-p'ic  vector  my  is  a  quantization  vector. 
The  competitive  learning  system  learns  as  synaptic  vectors  my  change  in  response  to  randomly  sampled 
training  data.  Geometrically,  learning  occurs  if  and  only  if  the  synaptic  vectors  my  move  in  the  pattern 

space  R" . 

Competitive  learning  adaptively  distributes  the  m  synaptic  vectors  mi, . .  .  ,mm  in  Rn  to  approximate 
the  unknown  probability  density  function  p(x)  of  the  random  pattern  vector  x.  The  patterns  x  are 
continuously  distributed  in  Rn .  p(x )  describes  their  distribution.  Where  the  patterns  x  are  dense  or 
sparse,  the  synaptic  vectors  my  tend  to  be  dense  or  sparse.  Different  competitive  learning,  or  adaptive 
vector  quantization  (AVQ),  schemes  distribute  the  synaptic  vectors  in  different  ways. 

If  p(x)  were  known,  learning  would  be  unnecessary.  Numerical  techniques3’22  could  directly  determine 
pattern  clusters  or  classes,  centroids,  and  class  boundaries. 

All  observed  patterns  are  realizations  of  a  single  random  vector  x.  The  random  vector  x  can  be 
interpreted  as  n  ordered  scalar  random  variables:  x  =  (xj, . . .  ,  z„).  This  is  heuristic  but  incomplete. 

A  random  vector  x  is  a  function.  It  is  a  measurable  function  from  a  sample  space  to  a  vector  space. 
(This  means19  inverse  images  x~*(0)  of  measurable  or  open  subsets  O  of  the  vector  space  are  measurable 
subsets  of  the  sample  space.)  In  the  autoassociative  case  both  spaces  are  Rn ,  so  x  :  Rn  — ♦  Rn .  The  sigma 
algebra  of  measurable  sets  is  the  Horel1  sigma  algebra  B(Rn),  the  topological  sigma  algebra  generated19 
from  the  open  subsets  of  R" .  In  practice  the  random  vector  x  :  R"  — ►  R "  is  just  the  identity  function: 
x(v)  v  for  all  v  in  R" 

The  cumulative  distribution  function  R  :  II(Rn)  — »  (0,  1]  maps  open  subsets  of  IV'  to  numbers  in 
(I),  1]  and  is  countably  additive  on  countably-infinite  disjoint  unions  of  subsets  of  R" .  p(x)  characterizes 
t  he  “randomness”  of  t  he  random  vector  x.  /’( O)  is  the  integral  of  p(x)  on  the  open  set  ()  C  R"  Since  x  is 
the  identity  function  on  R" .  p(x)  characterizes  the  occurrence  probability  of  the  observed  pattern  samples 
or  realizations  used  in  training  or  recognition. 

The  stochastic  pattern  recognition  framework  is  rigorously  defined  by  specifying  the  probability  space 
( IV' ,  H{R"),  I’)  and  the  pattern  random  vector  function  x.  In  the  default,  case  x  is  the  identity  function. 

Pattern  clusters  or  classes  are  subsets  of  IV' .  Some  pattern  classes  are  more  probable  than  others.  I  In' 
pattern  spare  IV'  is  partitioned  into  k  subsets  or  decision  classes  l)\,.  ..,  IR: 


ir 


=  D,  U  .  .  .U/)k  ami  D.nDj  =  «  if  i  ^  j  .  (1) 

The  distinction  between  supervised  and  unsupervised  pattern  learning  and  recognition  depends  on 
the  available  information.  In  both  cases  the  probability  density  function  p(x)  is  unknown.  That  is  why 
adaptive  techniques  are  used  instead  of,  say,  numerical  optimization  or  calculus-of- variation3  technique's. 

Supervised  learning  requires  more  information  than  unsupervised  learning.  Unsupervised  learning  uses 
minimal  information.  Pattern  learning  is  supervised  if  the  decision  classes  /}),...,  7dk  are  known  and  the 
learning  system  uses  this  information.  The  user  knows — and  the  algorithm  uses — the  class  membership 
of  every  sample  pattern  x.  The  user  knows  that  x  <  l),  and  that  x  /  Dj  for  all  j  ft.  t.  Learning  is 
unsupervised  if  class  memberships  are  unknown.  Supervised  learning  systems  allow  an  error  measure  or 
vector  to  be  computed.  The  simplest  error  measure  is  the  desired  outcome  minus  the  actual  outcome.  The 
error  measure  guides  the  learning  process  with  feedback  error  correction. 

AVQ  Class  Probability  Estimation 

The  partition  property  (1)  implies  that  p(D\)  +  . .  ,  +  p(Dt)  —  1  since  p(Rn)  =  1 .  The  class  probability 
p{l),)  is  given  by 


;>(£>.)  =  /  p(x)  <lx 

(2) 

Jr). 

=  K[Id.\  , 

(:*) 

where  the  integral  in  (2)  is  an  n-dimensional  multiple  integral.  E[x]  is  the  expectation  of  random  variable 
x.  The  function  Is  '■  Rn  —*  {0,  1}  is  the  indicator  function  of  set  5.  /s(x)  =  1  if  x  e  S,  /s(x)  =  0  if 
x  f  S.  In  the  probabilistic  setting  the  indicator  function  Is  is  random  (Borel  measurable1),  and  hence  a 
random  variable. 

A  pattern  x  is  in  exactly  one  decision  class — with  probability  one.  With  probability  zero,  pattern  x 
can  be  on  the  border  of  two  or  more  decision  classes.  p(x)  =  0  for  every  x  in  Rn . 

A  uniform  partition  gives  p{D{)  —  \/k  for  each  decision  class  Di  in  the  partition.  Uniform  partitions 
are  clearly  not  unique.  Some  vector  quantization  schemes  attempt  to  adaptively  partition  R"  into  a 
uniform  partition.  Then  it  should  be  equally  likely  that  a  pattern  sample  x  drawn  at  random  (according 
to  p(x))  from  Rn  was  drawn  from  any  one  of  the  k  decision  classes  D,  .  This  corresponds  to  each  competing 
neuron  “winning”  with  the  same  frequency.  Competitive  learning  has  been  modified2,20  in  several,  usually 
supervised,  ways  to  force  the  competing  n.'jrons  to  have  the  same  win  rate.  The  motivation  for  such 
modifications  is  economy:  fewer  neurons  arc  needed  to  estimate  a  sampled  continuous  function. 

Nonuniform  partitions  are  more  informative  than  uniform  partitions.  They  also  occur  more  frequently 
when  estimating  an  unknown  probability  density  function  ;>(x).  When  the  number  of  competing  neurons 
is  less  than  the  number  of  distinct  pattern  classes,  when  rn  <  k,  some  neurons  win  more  frequently  than 
others.  If /’(/A)  >  p(  Dj ) ,  the  competing  neuron  that  codes  for  l)t  tends  to  win  more  frequently  than  (  lie 
neuron  that  codes  for  Dj.  Equivalently,  more  sample  patterns  x  tend  to  be  closer  in  Euclidean  distance 
to  the  corresponding  synaptic  vector,  call  it  in,,  that  quantizes  I),  than  to  the  synaptic  vector  m,  that 
quantizes  l)j.  Below  we  show  that  in,  and  m;  tend  to  arrive  at  the  respective  centroids  of  I),  and  D} 
Cent  roids  minimize  the  mean-squared  error  of  vector  quantization lf\ 

In  general  there  are  more  competing  neurons  than  decision  chesses,  in  >  k.  Eor  neurons  can  always 
be  added  to  the  competitive  learning  system.  Then  if  ;>(  D, )  >  p(D}),  there  tend  to  be  more  synaptic 
vectors  within  I),  than  within  l)j.  In  principle  all  the  neurons  corresponding  to  the  synaptic  vectors  in 
I),  can  have  the  same  win  rates.  But,  since  metrical  '•lassilication  is  used  to  decide  which  neuron  wins, 
neurons  with  synaptic  vectors  nearer  the  centroid  of  I),  tend  to  win  more  frequently. 


'I’he  number  of  synaptic  vectors  in  decision  class  /.),  gives  a  nonparametric  estimate  of  the  class  prob¬ 
ability  p(Di)  :  p(Di)  =  jjj-,  where  n,  is  the  number  of  synaptic  vectors  in  l),.  In  general  the  quantizing 
synaptic  vectors  nonparametrically  estimate  the  probability  density  function  p(x).  No  probability  .assump¬ 
tions  need  be  made  about  the  observed  training  samples.  For  any  subset  or  volume  V  C  H" ,  the  volume 
probability  p(V)  is  estimated  .as  the  ratio 


P(V)  =  —  .  (-1) 

in 

where  ny  is  the  number  of  synaptic  vectors  niy  in  V  and  m  is  the  total  number  of  synaptic  vectors.  In 
the  extreme  case  (4)  gives  p(Hn)  =  1  and  p(0)  =  0. 


Deterministic  Competitive  Learning  Laws 


The  idea  behind  competitive  learning  is  learn  only  if  win.  Losing  neurons,  or  rather  their  synaptic 
fan-in  vectors,  do  not  learn.  They  also  do  not  forget  what  they  have  already  learned.  The  price  is  a 
nondistributed  representation5  The  synapses  in  a  synaptic  vector  nty  become  in  effect  “grandmother” 
synapses.  Each  synaptic  clemen.  rn,y  is  a  discrete  memory  unit,  as  in  a  random  access  memory. 

In  contrast,  classical  Hebbian6  or  correlation  learning  distributes  learned  pattern-vector  information 
across  the  entire  synaptic  connection  matrix  M .  But  a  Hebbian  system  forgets  learned  pattern  information 
as  it  learns  new  pattern  information. 

The  simplest  deterministic  competitive  learning ,l6’20  law  is,  in  component-wise  notation, 

m.j  =  Sy(yy)[S;(x,-)  -  m.y]  ,  (5) 

where  m,y  is  the  time  derivative  of  the  synaptic  value  of  the  directed  axonal  connection  from  the  ith 
neuron  in  the  input  field  Fx  to  the  jth  neuron  in  the  output  or  competitive  field  Fy-  The  n-by-m  matrix 
M  consists  of  the  m,-j  values.  The  yth  column  of  M  is  the  fan-in  synaptic  vector  my  —  (miy , . . . ,  m„y). 
Scaling  constants  can  be  added  or  multiplied  in  (5)  as  desired.  In  contrast,  the  signal  Hebbian  learning-14 
law  is 


rhti  =  -rriij  +  S,(x,)Sy(yy)  .  (6) 

(5)  and  (6)  differ  in  how  they  forget.  All  learning  requires  some  forgetting.  The  competitive  signal  Sy  in 
(5)  nonlinearly  scales  the  decay  or  forget  term  —  m,y.  In  practice7,11- 12,20  the  competitive  signal  Sy  is  a 
zero-one  or  binary  threshold  function.  Winners  forget,  losers  remember. 

There  are  n  neurons  in  F\  and  m  competing  neurons  in  Fy  ■  Each  neuron  in  Fx  or  Fy  is  a  function 
that  transduces  its  real-valued  activation  x,(<)  or  yy(<)  into  a  bounded  signal  S,(x,(t))  or  Sy(yy(t))  t>me 
i.  In  principle  the  activation  functions,  or  membrane  potential  differences,  x*  and  yy  can  be  unbounded. 

In  feedback  networks,  the  signal  functions  S;  and  Sy  are  usually  assumed  bounded  and  monotone 
itondecreasing.  So  their  activation  derivatives  S'{  and  S'-  are  nonncgativc.  In  practice  logistic  or  hyperbolic- 
tangent  signal  functions  arc  often  used.  Then  the  signal  functions  S,  and  .9;  are  strictly  increasing  and 
hence  their  activation  derivatives  are  positive: 


S' 


— —  >  0  and 
dx, 


3 


<ls> 


>  0 


(7) 


For  instance,  the  logistic  signal  function  S(x)  —  (1  — c-cr)-1  with  scale  constant  c  >  0  has  an  increasing 
activation  derivative  S'  —  c  S  (l  —  S)  >  0.  The  logistic  signal  function  rapidly  approaches  a  binary 
threshold  function  for  increasing  values  of  c. 
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In  competitive  learning  the  Fy  signal  functions  Sj  are  often  binary  threshold  functions.  Sj(t)  =  I 
if  the  jth  competing  neuron  ill  Fy  wins  the  competition  for  activation  at  time  t.  Sj(t)  —  0  if  the  jth 
neuron  loses. 

The  Fx  signal  functions  Si  are  usually  linear  in  feedforward  systems:  S,(x<)  =  x,.  Then  the  sample 
pattern  x  =  (xi,...,x„)  directly  activates  the  system  as  the  F\  signal  state  vector  Sjc(x)  =  x.  So  in 
practice  the  competitive  learning  law  (5)  becomes 

»»,y  =  (*.  -  »»ijl  ,  (b) 

where  Id,  is  the  zero-one  indicator  function  of  decision  class  Dj .  We  assume  synaptic  vector  lily  codes  for 
class  Dj,  perhaps  by  hovering  about  the  centroid  of  Dj. 

Kohonen’s  recent11  supervised  competitive  learning  (SCL)  law  is  a  reinforced  version  of  (5): 

rhij  =  rj(x)  Sj  [x,  -  m,y]  ,  (9) 

where  Sj  is  a  binary  threshold  function  determined  metrically.  Sj  =  I  if  x  is  closer  in  Euclidean  distance 
to  the  synaptic  vector  my  than  to  all  other  synaptic  vectors  in,-.  The  new  term  ry  in  (8)  is  the  reinforcement 
function  of  the  j'th  competing  neuron  in  Fy.  rj  rewards  when  ry(x)  =  1  and  punishes  when  ry(x)  =  —  1. 

Tlie  reinforcement  function  rj  is  determined  by  the  class  membership  of  the  pattern  sample  x.  So  (9) 
is  a  supervised  competitive  learning  law.  ry(x)  =  1  if  x  c  Dj  and  the  jth  neuron  wins  or  correctly 

“classifies”  x — if/Dj(x)  =  Sy(x)  =  1.  ry(x)  =  —1  if  the  winning  j'tli  neuron  misclassifies  the  sample 
pattern  x.  Misclassification  means  the  j th  neuron  wins  but  x  <  D,,  or  x  /  Dj,  for  some  i  j.  Then 
/Dj(x)  =  0  but  Id,  (x)  =  Sj  —  1.  Since,  with  probability  one,  x  belongs  to  exactly  one  decision  class, 
the  reinforcement  function  reduces  to  a  difference  of  decision-class  indicator  functions: 

=  lD,  -  E  •  (10) 

•V» 

(10)  makes  explicit  the  dependence  of  ry  on  the  knowledge  of  the  decision  class  boundaries. 

The  unsupervised  differential  competitive  learning16  (DCL)  law  modulates  the  vector  difference  x  — my 
with  the  competitive  win  rale  Sj : 


™ij  =  Sj{yj)  [Si(xi)  -  m,y]  ,  (11) 

where  the  signal  velocity  Sy  decomposes  as  S'-  yy  by  the  chain  rule.  The  idea  is  learn  only  if  change.  The 
signal  velocity  in  (11)  behaves  in  sign  much  as  the  reinforcement  function  in  (9).  The  signal  velocity  Sy(<) 
is  positive  or  negative  according  as  the  jth  competing  neuron’s  winning  status  is  increasing  or  decreasing 
at  time  t.  The  signal  velocity  does  not  depend  on  the  decision-class  indicator  functions.  So  the  DCL  law 
(11)  is  unsupervised. 

In  practice  the  Fx  signal  function  Si  is  linear.  Then  simulations12  show  that  the  DCL  law  and 
Kohonen’s  SCL  law  (9)  behave  similarly.  The  DCL  systems  tend  to  converge  to  decision  class  centroids  at 
least  as  fast  as  SCL  systems  do  and  tend  to  wander  about  the  centroids  with  less  variance.  The  competitive 
learning  laws  (5)  and  (9)  ignore  the  win-rate  information  provided  by  the  signal  velocity  in  (11). 

The  pulse-coded 1,10  signal  function  Sj  is  an  exponentially  weighted  average  of  binary  pulses: 

Sj(t)  =  f  y}(s)  e—  ds  ,  (12) 

where  the  pulse  function  y;  is  defined  by  ;/j(<)  =  1  if  a  pulse  is  present-  at  time  t  and  yy(f)  =  0  if  no 
pulse  is  present.  Then  the  signal  velocity  is  the  simple,  locally  available,  difference 


Tlic  velocity-difference  representation  (13)  eliminates  the  need  Tor  an  approximation  algorithm  to 
calculate  the  signal  velocity.  Biological,  or  silicon,  synapses  can  modify  their  values  in  realtime  with 
signal  velocity  information.  Biological  neurons  transmit  and  receive  pulse  trains,  not  real-valued  sigmoidal 
outputs.  The  presence  or  absence  of  a  pulse  is  easier  to  detect,  amplify,  and  emit  than  a  multi-valued 
signal.  (13)  shows  that  much  of  the  time  the  arriving  pulse  j /,■(/)  indicates  the  instantaneous  sign  of  the 
signal  velocity. 

The  pulse-coded  differential  competitive  law  approximates10  the  classical  competitive  law  (5)  as  can 
be  seen  by  substituting  (13)  into  (11)  and  expanding  terms.  A  related  approximation  of  the  signal  Hcbb 
law  (6)  occurs  when  (13)  eliminates  a  product  of  signal  velocities  in  a  comparable  differential  llcbbian 
learning 9-10.*3. 15-16  law. 

Stochastic  Competitive  Learning  Laws  and  Algorithms 


Stochastic  competitive  learning  laws  are  stochastic  differential  equations.  They  describe  how  synaptic 
random  processes  change  as  a  function  of  other  random  processes.  Their  solution  is  a  synaptic  random 
process21 . 

The  deterministic  competitive  learning  laws  (5),  (9),  and  (11)  are  simple  stochastic  differential  equa¬ 
tions  if  the  signal  terms  S,(x,(t))  are  random  variables  at  each  time  t.  This  is  so  when  the  sample  vectors 
x  are  random  samples,  realizations  of  the  pattern  random- vector  process  x  :  Rn  —*  Rn .  The  randomness 
in  the  vector  components  x,-  induces  randomness  in  the  signal  function  Si  and  thus  in  the  synaptic  vectors 
rn,  .  In  general  each  term  in  a  stochastic  differential  equation  is  a  random  process. 

Another  simple  stochastic  differential  equation  arises  when  random  noise  is  added  to  a  differential 
equation.  The  randomness  in  the  noise  process  induces  randomness  in  the  dependent  variables.  In  general, 
and  in  this  discussion,  an  independent  noise  process  is  added  to  a  stochastic  differential  equation. 

The  stochastic  competitive  learning  law  is,  in  vector  notation, 

dm,-  =  Sj (y; )  [S(x)  -  m;]  dt  +  dB;  ,  (14) 

where  Sj  is  a  steep  competitive  signal  process  taking  values  in  [0, 1]  and  S(x)  =  (Si(xi), . . S„(x„))  for 
random  pattern  x.  B;  is  a  Brownian  motion  diffusion  process. 

The  pseudo-derivative21  of  B;  is  the  zero-mean  white  noise  process  n,.  The  pseudo-derivative  can  be 
formed  as  a  mean-squared  limit.  The  noise  process  n ;  is  zero-mean,  =  0,  has  finite  variance,  and 

is  independent  of  the  “signal”  term  Sj(yj)  (S(x)  -  m;].  Then  competitive  learning  laws  can  be  written 
in  less  rigorous,  more  intuitive,  “noise”  notation.  For  example,  (14)  becomes 

m,  =  Sffyj)  (S(x)  -  in;]  +  n,  .  (15) 

In  practice  Sj  is  a  binary  threshold  function  and  can  often  be  replaced  with  the  class  indicator  function 
l[)r  The  Fx  signal  processes  5,  are  linear.  So  (15)  becomes 

in;  =  /Hj(x)[x  -  Ill;]  +  ll;  (15) 

The  stochastic  version  of  Kolionen’s  supervised  competitive  learning  (SCL)  law  is 

in;  =  r;(x)  Sj(yj)  (x  -  in;]  +  n;  .  (If) 

The  stochastic  version  of  the  differential  competitive  learning  (DCL)  law  is 

Ill;  =  S;(y;)[x  -  III;]  +  H;  ,  (15) 

or,  in  pulse-coded  form, 


G 


»»»>  =  MO  -  MO]  (x  -  »“;]  +  “i  .  (i(J) 

whore  the  pulse  process  yj  is  a  random  point  process,  perhaps  Poisson  in  nature.  (19)  is  thus  a  doubly 
stochastic  synaptic  model. 

For  practical  implementation  these  three  stochastic  competitive  learning  models  can  be  written  as 
stochastic  difference  equations  by  replacing  the  third  step  in  the  following  competitive  AVQ  algorithm. 
(Historical  note:  Tsypkin22  derived  the  “winning”  parts  of  the  UCL  algorithm  and,  with  his  adaptive 
Hayes  approach,  the  SCL  algorithm  in  a  non- neural  context.)  A  random  noise  term  has  not  been  added 
to  the  difference  equations.  The  noise  processes  in  the  above  models  can  represent  unmodcled  effects, 
roundoff  errors,  or  sample-size  defects. 

Competitive  AVQ  Algorithms 

1.  Initialize  synaptic  vectors:  ini(O)  =  x(i),  i  =  1 ,  ...,  in. 

2.  For  random  sample  x(<),  find  the  closest  (“winning”)  synaptic  vector  m j(t): 

llmi(0  -  X(0II  =  »Y'»  l|m,(0  -  x(OII  ,  (20) 

where  ||x||2  =  x2  +  . . .  +  x2  is  the  squared  Euclidean  norm  of  x. 

3.  Update  the  winning  synaptic  vector(s)  my(f)  by  the  UCL,  SCL,  or  DCL  learning  algorithm. 

Unsupervised  Competitive  Learning  (UCL) 

m j(t  +  1)  =  m7-(<)  +  cf[x(0  -  my(<)]  ,  (21) 

m,((  +  1)  =  m,(<)  if  j  £  j  , 

where,  in  the  spirit  of  stochastic  approximation22,  c<  is  a  slowly  decreasing  sequence  of  learning  coefficients. 
For  instance,  c«  =  .1(1  -  jq^oq)  for  10,000  samples  x(<). 

Supervised  Competitive  Learning  (SCL) 

tnj(t  +  1)  =  n\j(t)  +  c<  r;(x(t))  (x(<)  -  niy(t)] 

i«j(0  +  c,  [x(0  -  iiij(O)  if  x  correctly  classified 
nij(<)  —  c<  (x(<)  —  iUj(0]  ^  x  misclassified. 

Differential  Competitive  Learning  (DCL) 

=  i Uj(i)  +  ctASj(y}(t))  [x(/ )  -  Ill !>(<)]  , 

=  ««»(0  if  1  /  J  . 


(22) 

(23) 


mj(t  +  1) 
m.(<  +  1) 


(2d) 


where  ASj(yj(t))  is  the  time  change  of  the  j'tli  neuron’s  coni|  etitive  signal  S}  (i/; )  in  the  competition 
field  Fy: 


AS}(Vj(l))  =  S}(\/j(t  +  1))  -  S,(V](t))  (25) 

In  practice12  only  the  sign  of  the  diirercnce  (25)  may  he  used.  The  Fy  neuronal  activations  j/,  can  he 
updated  by  an  additive  model: 


%(*+!)  =  Vj( 0  +  +  ^2sk(yk)  wkj  .  (2G) 

i=l  1=1 

The  fixed  competition  matrix  W  defines  a  symmetric  lateral  inhibition  topology  within  Fy .  In  the  simplest 
case,  wjj  =  1  and  Wij  =  —I  for  distinct  i  and  j. 

Stochastic  Equilibrium  and  Convergence 


Competitive  synaptic  vectors  nij  converge  to  decision  class  centroids.  The  centroids  may  be  local 
maxima  of  the  sampled  but  unknown  probability  density  function  ?>(x). 

In  general,  when  there  are  more  synaptic  vectors  than  probability  maxima,  the  synaptic  vectors  cluster 
about  local  probability  maxima.  Comparatively  few  synaptic  vectors  may  actually  arrive  at  centroids.  We 
only  consider  convergence  to  centroids.  The  justification  is  that  any  local  connected  patch  of  the  sample 
space  Rn  can  be  viewed  as  a  candidate  decision  class.  Each  synaptic  vector  samples  such  a  local  patch 
and  converges  to  its  centroid. 

We  first  prove  the  AVQ  Centroid  Theorem:  If  a  competitive  AVQ  system  converges,  it  converges  to 
the  centroid  of  the  sampled  decision  class.  The  AVQ  Centroid  Theorem  is  an  equilibrium  or  steady-state 
result.  We  prove  the  theorem  only  for  unsupervised  competitive  learning,  but  argue  that  it  holds  for 
supervised  and  differential  competitive  learning  in  most  cases  of  practical  interest. 

Next  we  use  a  Lyapunov  argument  to  reprove  and  extend  the  AVQ  Centroid  Theorem  to  the  AVQ 
Convergence  Theorem:  Stochastic  competitive  learning  systems  are  asymptotically  stable,  and  synaptic 
vectors  converge  to  centroids.  So  competitive  AVQ  systems  always  converge,  and  converge  exponentially 
fast.  Both  results  are  true  with  probability  one. 

The  unknown  probability  density  function  p(x)  defines  the  class  centroids,  the  mean-squared  optimal 
vectors  of  quantization.  Competitive  learning  estimates  these  optimal  quantization  vectors  without  knowl¬ 
edge  of  p(x).  That  is  the  advantage  of  competitive  learning,  and  optimal  learning  in  general. 

AVQ  Centroid  Theorem: 


Prob(m;-  =  Xj)  —  1  at  equilibrium  .  (27) 

The  centroid  x;  of  decision  class  Dj  is  defined  by 


/  x  p(x)  dx 

Jp, 

r 

(28) 

/  7'(x)  dx 

Jp, 

E’[x|x  c  Dj]  . 

(29) 

The  random  vector  E[  x(-  ],  the  conditional  expectation,  is  a  function  of  Borel  measurable  subsets  Dj  of  Rn . 


Proof.  Suppose  the  j'tli  neuron  in  Fy  wins  the  activation  competition  during  the  training  interval. 
Suppose  the  j’tli  synaptic  vector  nij  codes  for  decision  class  Dj .  So  /o,(x)  =  1  iff  Sj  =  1.  Suppose 

stochastic  equilibrium  has  been  reached: 


•  »j  =  O  ,  (30) 

which  holds  with  probability  one  (or  in  the  mean-square  sense,  depending  on  how  the  stochastic  differentials 
are  defined).  Take  expectations  of  both  sides  of  (30),  use  the  zero-mean  property  of  the  noise  process, 
eliminate  the  synaptic  velocity  vector  in,  with  the  competitive  law  (10),  and  expand  to  give 


O  =  E(iiij] 

(31) 

=  /  ^d,(x)  (x  -  m,)  p(x)  dx  -f-  £[n,] 

JR' 

(32) 

-  /  (x  -  in, )  ;>(x)  dx 

Jo, 

(33) 

—  /  x  p(x)  dx  —  m,  /  p(x)  dx  , 

Jd,  Jd, 

(34) 

since  m,  is  constant  with  probability  one.  Solving  for  the  equilibrium  synaptic  vector  in,-  gives  the  cen¬ 
troid  x,  defined  by  (28).  Q.E.D. 

In  general  the  AVQ  Centroid  Theorem  concludes  that  the  average  synaptic  vector  E[m,]  equals  the 
jth  centroid  x,  at  equilibrium: 


E[m,]  =  x;  .  (35) 

The  equilibrium  synaptic  vector  in,  vibrates  randomly  around  the  constant  centroid  x,-.  m,  equals  x, 
on  average  at  each  post-equilibration  instant.  Simulations12  exhibit  such  random  wandering  about  the 
centroid. 

Synaptic  vectors  learn  noise  as  well  as  signal.  So  they  vibrate  at  equilibrium.  The  independent  additive 
noise  process  n,  in  (16)  drives  the  random  vibration.  The  steady-state  condition  (30)  models  the  rare 
event  that  noise  cancels  signal.  In  general  it  models  stochastic  equilibrium  in  the  absence  of  additive  noise. 

The  general  stochastic  steady-state  condition  is  defined  by  the  stochastic  differential  equation 

lii,  =  n,  (36) 

Taking  expectations  of  both  sides  of  (36)  still  gives  (31),  since  the  noise  process  n,  is  zero-mean,  and 
the  argument  proceeds  as  before.  Taking  a  second  expectation  in  (33)  and  using  (31)  gives  (35). 

'File  AVQ  Centroid  Theorem  applies  to  the  stochastic  SCI,  law  (17)  because  winners  are  picked 
metrically  by  the  nearest-neighbor  criterion  (20).  The  reinforcement  function  rj  in  (10)  reduces  to 
r,(x)  =  —  //j,(x)  =  —1  if  the  jth  neuron  continually  wins  for  random  samples  x  from  class  D, 

This  tends  to  occur  once  the  synaptic  vectors  have  spread  out  in  Rn  and  D,  is  close,  usually  contiguous, 
to  Dj .  Then  in,  converges  to  x*,  the  centroid  of  D,-,  since  the  steady  state  condition  (30)  removes  the 
scaling  constant  —1  that  then  appears  in  (33). 

This  argument  holds  only  approximately  when,  in  the  exceptional  case,  in,  repeatedly  misclassifies 
patterns  x  from  several  classes  Dk •  Then  the  difference  of  indicator  functions  in  (10)  replaces  the  single 
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indicator  function  lo)  in  (32).  The  resultant  equilibrium  my  is  a  more  general  ratio  than  the  centroid. 
The  density  p(x)  must  be  integrated  over  Rn  not  just  Dj . 

The  AVQ  Centroid  Theorem  applies  similarly  to  the  stochastic  DCL  law  (18).  A  positive  or  negative 
factor  scales  the  difference  x  —  nij .  If,  as  in  practice  and  in  (24),  a  constant  approximates  the  scaling 
factor,  the  steady  state  condition  (30)  removes  the  constant  from  (33)  and  in;  estimates  the  centroid  Xj . 

The  integrals  in  (31)  -  (34)  are  spatial  integrals  over  Rn  or  subsets  of  Rn.  Yet  in  the  discrete  UCL, 
SCL,  and  DCL  algorithms,  the  recursive  equations  for  my(<  +  1 )  define  temporal  integrals  over  the  training 
interval. 

The  two  integrals  are  approximately  equal.  The  discrete  random  samples  x(0),x(l),  x(2), . .  .  partially 
enumerate  the  continuous  distribution  of  equilibrium  realizations  of  the  random  vector  x.  The  time 
index  in  the  discrete  algorithms  approximates  the  “spatial  index”  underlying  p(x).  So  the  recursion 
ntj(l  +  1)  =  my(<)  +  ...  approximates  the  averaging  integral.  We  sample  patterns  one  at  a  time.  We 
integrate  them  all  at  a  time. 

The  AVQ  Centroid  Theorem  assumes  that  stochastic  convergence  occurs.  Convergence  is  trivial  for 
continuous  deterministic  competitive  learning,  at  least  in  feedforward  networks.  If  Sy  is  a  positive  constant 
in  (5),  then  m,y  converges  to  Si  exponentially  fast.  Convergence  is  not  trivial  for  stochastic  competitive 
learning  in  noise. 

The  AVQ  Convergence  Theorem  ensures  exponential  convergence.  The  theorem  does  not  depend  on 
how  the  Fy  neurons  change  in  time.  In  effect  metrical  classification  is  assumed:  Sj  =  liff/o^x)  =  1. 
The  strictly  decreasing  deterministic  Lyapunov  function  E[L]  replaces16  the  random  Lyapunov  function 
L  :  C  —*  R,  where  C  is  a  closed  and  bounded  (compact)  subset  of  Rm . 

A  strictly  decreasing  Lyapunov  function  yields  asymptotic  stability17.  Then  the  real  parts  of  the 
eigenvalues  of  the  system  Jacobian  matrix  are  strictly  negative,  and  locally  the  nonlinear  system  be¬ 
haves  linearly.  Synaptic  vectors  converge6  exponentially  quickly  to  equilibrium  points — to  pattern-class 
centroids — in  the  state  space.  Technically,  nondegenerate  Hessian  matrix  conditions  must  be  assumed. 
Else  some  eigenvalues  can  have  zero  real  parts. 

AVQ  Convergence  Theorem:  Competitive  synaptic  vectors  converge  exponentially  quickly  to  pattern- 

class  centroids. 


Proof.  Consider  the  random  quadratic  form  L: 

j  n  m 

L  =  -  ^  ^  (x,  —  rriij  )2  .  (37) 

*  3 

Note  that  if  x  =  xy  in  (37),  then  with  probability  one  L  >  0  if  any  in y  ^  xy  and  L  -  0  iff  m;  =  Xy 
for  every  my. 

The  pattern  vectors  x  do  not  change  in  time.  (The  following  argument  is  still  valid  if  the  pattern 
vectors  x  change  slowly  relative  to  synaptic  changes — if  the  density  p(x)  is  mildly  nonstationary.)  This 
simplifies  the  stochastic  derivative  of  L : 


L 


\  ^  dL 
d7i  Xi 


v — '  \ .  d  L 

+  EEs 


EE 


dL 

diriij 


»h.y 


(38) 


(39) 


10 


(40) 


=  -  5Z53(*.-  -  m*>)  «*■>■ 

»  i 

=  -  5ZX^/d^x)(x*  _  ’"•■>)■  ~  '"■j)  ”•>  (4i) 

>  j  >  i 

L  is  a  random  variable  at  every  time  t.  E[L\  is  a  deterministic  number  at  every  t.  Tire  trick  is  to  use  the 
average  E[L\  as  a  Lyapunov  function  for  the  stochastic  competitive  dynamical  system.  For  this  we  must 
assume  sufficient  smoothness  to  interchange  the  time  derivative  and  the  probabilistic  integral — to  bring 
the  time  derivative  “inside”  the  integral.  Then  the  zero-mean  noise  assumption,  and  the  independence  of 
the  noise  process  i\j  with  the  “signal”  process  x  —  m7 ,  gives 


E[L\  =  E[L) 


(42) 


m,;)2  p(x)  dx 


(43) 


So,  on  average  by  the  learning  law  (1G),  E[L ]  <  0  ifT  any  synaptic  vector  nij  moves  along  its  trajectory. 
So  the  competitive  AVQ  system  is  asymptotically  stable8,17  and,  in  general,  converges  exponentially  quickly 
to  equilibria. 

Suppose  E[L\  =  0.  Every  synaptic  vector  has  reached  equilibrium  and  is  constant  (with  probability 
one).  Then19,  since  p(x)  is  a  nonnegative  weight  function,  the  weighted  integral  of  the  learning  differences 
x i  —  m,j  must  also  be  zero: 


I  (x  —  m,)  p(x)  dx  =  O  (44) 

JOi 

in  vector  notation.  (44)  is  identical  to  (33).  So,  with  probability  one,  equilibrium  synaptic  vectors  are 
centroids.  More  generally,  as  discussed  above,  (35)  holds.  Average  equilibrium  synaptic  vectors  are  cen¬ 
troids:  £[m7]  =  \j.  Q.E.D. 


The  sum  of  integrals  (43)  defines  the  total  mean-squared  error  of  vector  quantization  for  the  partition 
Di, . . . ,  Dk-  The  vector  integral  in  (44)  is  the  gradient  of  E[L]  with  respect  to  m j.  So  the  AVQ  Convergence 
Theorem  implies  that  class  centroids — and,  asymptotically,  competitive  synaptic  vectors — minimize  the 
mean-squared  error  of  vector  quantization. 

Then  by  (16),  the  synaptic  vectors  perform  stochastic  gradient  descent  on  the  mean-squared-error  sur¬ 
face  in  the  pattern-plus-error  space  /?n  +  1.  The  difference  x(<)  —  n\j(t)  behaves  as  an  error  vector.  The 
competitive  system  estimates  the  unknown  centroid  x;-  as  x(()  at  each  time  t.  Learning  is  unsupervised 
but  proceeds  as  if  it  were  supervised. 
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